Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm,arXiv - CS - Computational Engineering, Finance, and Science

当前位置： X-MOL 学术 › arXiv.cs.CE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm
arXiv - CS - Computational Engineering, Finance, and Science Pub Date : 2019-02-12 , DOI: arxiv-1902.04341
Can Firtina, Jeremie S. Kim, Mohammed Alser, Damla Senol Cali, A. Ercument Cicek, Can Alkan, Onur Mutlu

Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the subject's genome), which is further used in downstream genome analysis. Unfortunately, long reads have high sequencing error rates and a large proportion of bps in these long reads are incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e., read-to-assembly alignment information). However, assembly polishing algorithms can only polish an assembly using reads either from a certain sequencing technology or from a small assembly. Such technology-dependency and assembly-size dependency require researchers to 1) run multiple polishing algorithms and 2) use small chunks of a large genome to use all available read sets and polish large genomes. We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e., both large and small genomes) using reads from all sequencing technologies (i.e., second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real read sets demonstrate that Apollo is the only algorithm that 1) uses reads from any sequencing technology within a single run and 2) scales well to polish large assemblies without splitting the assembly into multiple parts.

中文翻译：

Apollo：一种独立于测序技术、可扩展且准确的装配抛光算法

第三代测序技术产生的长读长用于构建组装（即受试者的基因组），进一步用于下游基因组分析。不幸的是，长读取具有很高的测序错误率，并且这些长读取中的很大一部分 bps 被错误识别。这些错误会传播到组装并影响基因组分析的准确性。组装完善算法通过使用来自读取和组装之间的对齐的信息（即，读取到组装对齐信息）来完善或修复组装中的错误来最小化这种错误传播。但是，装配抛光算法只能使用来自特定测序技术或小型装配的读取来抛光装配。这种技术依赖性和装配大小依赖性要求研究人员 1) 运行多种优化算法，2) 使用大基因组的小块来使用所有可用的读取集并优化大基因组。我们介绍了 Apollo，这是一种通用的组装抛光算法，可以很好地扩展以使用来自所有测序技术（即第二代和第三代）的读数来完善任何大小（即，大基因组和小基因组）的组装。我们的目标是提供一种单一算法，该算法使用来自所有可用测序技术的读取集来提高组装完善的准确性，并可以完善大型基因组。Apollo 1) 将程序集建模为配置文件隐藏马尔可夫模型 (pHMM)，2) 使用读取到程序集对齐来训练 pHMM，并使用前向后向算法，和 3) 使用 Viterbi 算法对训练后的模型进行解码以生成抛光组件。我们对真实读取集的实验表明，Apollo 是唯一一种算法，1) 在单次运行中使用来自任何测序技术的读取，2) 可以很好地扩展以完善大型组件，而无需将组件拆分为多个部分。

更新日期：2020-10-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文