当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient, robust and effective rank aggregation for massive biological datasets
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2021-06-09 , DOI: 10.1016/j.future.2021.06.013
Pierre Andrieu , Bryan Brancotte , Laurent Bulteau , Sarah Cohen-Boulakia , Alain Denise , Adeline Pierrot , Stéphane Vialette

Massive biological datasets are available in various sources. To answer a biological question (e.g., “which are the genes involved in a given disease?”), life scientists query and mine such datasets using various techniques. Each technique provides a list of results usually ranked by importance (e.g., a list of ranked genes). Combining the results obtained by various techniques, that is, combining ranked lists of elements into one list of elements is of paramount importance to help life scientists make the most of various results and prioritize further investigations. Rank aggregation techniques are particularly well-fitted with this context as they take in a set of rankings and provide a consensus, that is, a single ranking which is the “closest” to the input rankings. However, (i) the problem of rank aggregation is NP-hard in most cases (using an exact algorithm is currently not possible for more than a few dozens of elements) and (ii) several (possibly very different) exact solutions can be obtained. As answer to (i), many heuristics and approximation algorithms have been proposed. However, heuristics cannot guarantee how far from an exact solution the consensus ranking will be, and the approximation ratio of approximation algorithms dedicated to the problem is fairly high (not less than 3/2). No solution has yet been proposed to help true-users dealing with the problem encountered in point (ii).

In this paper we present a complete system able to perform rank aggregation of massive biological datasets. Our solution is efficient as it is based on an original partitioning method making it possible to compute a high-quality consensus using an exact algorithm in a large number of cases. Our solution is robust as it clearly identifies for the user groups of elements whose relative order is the same in any optimal solution. These features provide answers to points (i) and (ii) and lie in mathematical bases offering guarantees on the computed result. Also, our solution is effective as it has been implemented into a real tool, ConquR-BioV2 is used for the life science community, and evaluated at large-scale using a very large number of datasets.



中文翻译:

海量生物数据集的高效、稳健和有效的秩聚合

海量生物数据集可从各种来源获得。为了回答生物学问题(例如,“哪些基因与特定疾病有关?”),生命科学家使用各种技术查询和挖掘此类数据集。每种技术都提供通常按重要性排列的结果列表(例如,排序基因的列表)。结合各种技术获得的结果,即将排序的元素列表合并为一个元素列表,对于帮助生命科学家充分利用各种结果并确定进一步研究的优先顺序至关重要。排名聚合技术特别适合这种情况,因为它们采用一组排名并提供共识,即与输入排名“最接近”的单一排名。然而,(i) 在大多数情况下,秩聚合问题是 NP-hard 问题(对于超过几十个元素,目前不可能使用精确算法),并且(ii)可以获得几个(可能非常不同的)精确解. 作为对(i)的回答,已经提出了许多启发式和近似算法。然而,启发式无法保证共识排序离精确解有多远,针对该问题的逼近算法的逼近率相当高(不少于3/2)。尚未提出任何解决方案来帮助真正的用户处理第 (ii) 点中遇到的问题。

在本文中,我们提出了一个完整的系统,能够执行海量生物数据集的等级聚合。我们的解决方案是高效的,因为它基于原始分区方法,可以在大量情况下使用精确算法计算高质量的共识。我们的解决方案是稳健的,因为它清楚地识别了在任何最佳解决方案中相对顺序相同的元素用户组。这些特征提供了第 (i) 点和 (ii) 点的答案,并位于为计算结果提供保证的数学基础中。此外,我们的解决方案是有效的,因为它已被实施到一个真正的工具中,ConquR-BioV2 用于生命科学界,并使用大量数据集进行了大规模评估。

更新日期:2021-06-22
down
wechat
bug