当前位置: X-MOL 学术Bioinformatics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning sparse log-ratios for high-throughput sequencing data
Bioinformatics ( IF 5.8 ) Pub Date : 2021-09-08 , DOI: 10.1093/bioinformatics/btab645
Elliott Gordon-Rodriguez 1 , Thomas P Quinn 2 , John P Cunningham 1
Affiliation  

Motivation The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. Results Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. Availability and implementation The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore. Supplementary information Supplementary data are available at Bioinformatics online.

中文翻译:

学习高通量测序数据的稀疏对数比

动机 自动发现与感兴趣的结果相关的稀疏生物标志物是生物信息学的中心目标。在高通量测序 (HTS) 数据和更普遍的成分数据 (CoDa) 的背景下,一类重要的生物标志物是输入变量之间的对数比。然而,从 HTS 数据中识别预测性对数比生物标志物是一个组合优化问题,在计算上具有挑战性。现有方法运行缓慢,并且无法根据输入的维度扩展,这限制了它们在低维和中维宏基因组数据集上的应用。结果 基于深度学习领域的最新进展,我们提出了 CoDaCoRe,一种识别稀疏、可解释和预测对数比生物标志物的新型学习算法。我们的算法利用连续松弛来近似潜在的组合优化问题。然后可以使用现代 ML 工具箱(特别是梯度下降)有效地优化这种松弛。因此,CoDaCoRe 的运行速度比竞争方法快几个数量级,同时在预测准确性和稀疏性方面实现了最先进的性能。我们验证了 CoDaCoRe 在广泛的微生物组、代谢物和 microRNA 基准数据集以及一个特别高维的数据集上的出色表现,该数据集对于现有的稀疏对数比选择方法来说在计算上是完全难以处理的。可用性和实施​​ CoDaCoRe 包可在 https://github.com/egr95/R-codacore 获得。重现我们结果的代码和说明可在 https://github 上获得。com/cunningham-lab/codacore。补充信息 补充数据可在 Bioinformatics 在线获取。
更新日期:2021-09-08
down
wechat
bug