当前位置: X-MOL 学术Nat. Biotechnol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Large multiple sequence alignments with a root-to-leaf regressive method.
Nature Biotechnology ( IF 33.1 ) Pub Date : 2019-12-02 , DOI: 10.1038/s41587-019-0333-6
Edgar Garriga 1 , Paolo Di Tommaso 1 , Cedrik Magis 1 , Ionas Erb 1 , Leila Mansouri 1 , Athanasios Baltzis 1 , Hafid Laayouni 2, 3 , Fyodor Kondrashov 4 , Evan Floden 1 , Cedric Notredame 1, 5
Affiliation  

Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around from the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes6.

中文翻译:

使用根到叶回归方法进行大型多序列比对。

多序列比对 (MSA) 用于结构 1、2 和进化预测 1、2,但比对大型数据集的复杂性需要使用近似解 3,包括渐进算法 4。渐进式 MSA 方法首先对齐最相似的序列,然后根据引导树将剩余的序列从叶到根合并。随着序列数量的增加,它们的准确性会大幅下降5。我们引入了一种回归算法,可以在标准工作站上支持多达 140 万个序列的 MSA,并显着提高超过 10,000 个序列的数据集的准确性。我们的回归算法与渐进算法相反,首先对齐最不相似的序列。它使用有效的分治策略在线性时间内运行第三方对齐方法,而不管它们最初的复杂性如何。我们的方法将能够分析极其庞大的基因组数据集,例如最近宣布的地球生物基因组计划,该计划包含 150 万个真核基因组6。
更新日期:2019-12-02
down
wechat
bug