当前位置: X-MOL 学术Syst. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments
Systematic Biology ( IF 6.1 ) Pub Date : 2019-09-23 , DOI: 10.1093/sysbio/syz063
Metin Balaban 1 , Shahab Sarmashghi 2 , Siavash Mirarab 2
Affiliation  

Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze datasets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome-skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.

中文翻译:

苹果:可扩展的基于距离的系统发育布局,有或没有比对

将新物种置于现有的系统发育中与多种应用的相关性越来越大。放置可用于以可扩展的方式更新系统发育,并且可以帮助使用(元)条形码、略读或宏基因组数据识别未知的查询样本。存在系统发育放置的最大似然 (ML) 方法,但这些方法无法扩展到具有数千个叶子的参考树,限制了它们在现代参考库中享受密集分类单元采样的好处的能力。它们还依赖于参考集的组装序列和查询的对齐序列。因此,机器学习方法无法分析参考由未组装读数组成的数据集,这是与用于样本识别的基因组撇读的新兴应用相关的场景。我们介绍 APPLES,一种基于距离的系统发育放置方法。与 ML 相比,APPLES 的速度快一个数量级,内存效率更高,而且与 ML 不同的是,它能够放置在大型主干树上(经过多达 200,000 个叶子的测试)。我们证明,使用密集参考可以显着提高准确性,因此密集树上的 APPLES 比稀疏树上的 ML 更准确(稀疏树上可以运行)。最后,APPLES 可以使用基于 kmer 的距离准确识别样本,而无需组装参考或对齐查询,这是 ML 无法处理的场景。APPLES 可在 github.com/balabanmetin/apples 上公开获取。
更新日期:2019-09-23
down
wechat
bug