Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm.,Bioinformatics

当前位置： X-MOL 学术 › Bioinformatics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm.
Bioinformatics ( IF 4.4 ) Pub Date : 2020-03-13 , DOI: 10.1093/bioinformatics/btaa179
Can Firtina ₁ , Jeremie S Kim _{1,

2} , Mohammed Alser ₁ , Damla Senol Cali ₂ , A Ercument Cicek ₃ , Can Alkan ₃ , Onur Mutlu _{1,

2,

3}

Affiliation

Motivation

Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs (bp). These long reads are used to construct an assembly (i.e., the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of bps in these long reads are incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e., read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads either from a certain sequencing technology or from a small assembly. Such technology-dependency and assembly-size dependency require researchers to 1) run multiple polishing algorithms and 2) use small chunks of a large genome to use all available read sets and polish large genomes, respectively.

Results

We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e., both large and small genomes) using reads from all sequencing technologies (i.e., second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real read sets demonstrate that Apollo is the only algorithm that 1) uses reads from any sequencing technology within a single run and 2) scales well to polish large assemblies without splitting the assembly into multiple parts.

Supplementary information

Supplementary dataSupplementary data is available at Bioinformatics online. online.

Availability

Source code is available at https://github.com/CMU-SAFARI/Apollo

中文翻译：

Apollo：独立于测序技术，可扩展且准确的装配抛光算法。

动机

第三代测序技术可以对包含多达200万个碱基对（bp）的长读进行测序。这些长读段用于构建装配体（即受试者的基因组），并将其进一步用于下游基因组分析。不幸的是，第三代测序技术具有很高的测序错误率，而且这些长读段中的大部分bps被错误地识别。这些错误传播到装配体并影响基因组分析的准确性。装配抛光算法通过使用读取和装配之间的比对信息（即读取装配的装配信息），通过抛光或固定装配中的错误来最大程度地减少此类错误传播。然而，当前的程序集抛光算法只能使用某些测序技术或小型程序集的读数来抛光程序集。这种技术依赖性和装配尺寸依赖性要求研究人员1）运行多种修饰算法，以及2）使用大型基因组的小片段来使用所有可用的阅读集并分别修饰大型基因组。

结果

我们介绍了Apollo，这是一种通用的装配体抛光算法，可使用来自所有测序技术（例如第二代和第三代）的读数进行良好缩放，以抛光任何大小的装配体（即大和小的基因组）。我们的目标是提供一种单一算法，该算法使用来自所有可用测序技术的读取集来提高装配抛光的准确性，并可以抛光大型基因组。Apollo 1）将装配体建模为轮廓隐藏马尔可夫模型（pHMM），2）使用前向装配体对齐方式通过Forward-Backward算法训练pHMM，3）使用Viterbi算法解码训练后的模型以生成抛光的组件。

补充资料

补充数据补充数据可从Bioinformatics在线获得。线上。

可用性

源代码位于https://github.com/CMU-SAFARI/Apollo

更新日期：2020-03-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11