当前位置: X-MOL 学术Gigascience › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads.
GigaScience ( IF 11.8 ) Pub Date : 2020-09-01 , DOI: 10.1093/gigascience/giaa094
Mengyang Xu 1, 2, 3 , Lidong Guo 1, 4 , Shengqiang Gu 1, 4 , Ou Wang 3, 5 , Rui Zhang 1 , Brock A Peters 3, 6 , Guangyi Fan 1, 3 , Xin Liu 1, 2, 3, 7 , Xun Xu 3, 7 , Li Deng 1, 2, 3 , Yongwei Zhang 3, 6
Affiliation  

BACKGROUND Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. In recent years single-molecule sequencing techniques generating long-read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100 Mb), although bioinformatic tools for these applications are still limited. FINDINGS We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (∼10×) long single-molecule reads. The algorithm extracts reads that bridge gap regions between 2 contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of 3 human genome assemblies by 24-fold on average with only ∼10× coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single-molecule reads, enabling high-quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra-large genome assemblies, such as the ginkgo (∼12 Gb), TGS-GapCloser can cover 71.6% of gaps with sequence data. CONCLUSIONS TGS-GapCloser can close gaps in large genome assemblies using raw long reads quickly and cost-effectively. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser.

中文翻译:


TGS-GapCloser:一种快速、准确的间隙闭合器,适用于大型基因组,且易于出错的长读长覆盖率较低。



背景技术使用基因组组装的分析受到这些组装的连续性、完整性和准确性的严重影响。近年来,生成长读长信息的单分子测序技术已经可用,并且能够显着改善重叠群长度和基因组完整性,特别是对于大型基因组(> 100 Mb),尽管用于这些应用的生物信息工具仍然有限。研究结果我们开发了一种软件工具来缩小基因组组装中的序列间隙,TGS-GapCloser,它使用低深度(∼10×)长单分子读取。该算法提取桥接支架内 2 个重叠群之间间隙区域的读数,仅对候选读数进行错误纠正,并将最佳序列数据分配给每个间隙。作为演示,我们使用 TGS-GapCloser 将 3 个人类基因组组装体的 scaftig NG50 值平均提高了 24 倍,Oxford Nanopore 或 Pacific Biosciences 读数的覆盖率仅为 ∼10 倍,覆盖了高达 94.8% 的序列数据缺口阳性预测值为 97.7%。尽管单分子读取的原始错误率很高,但这些改进的组件实现了 99.998% (Q46) 的单碱基准确性,最终插入的序列具有 99.97% (Q35) 的准确性,从而实现了高质量的下游分析,包括高达 31- scaftig NGA50 增加了数倍,并且 BUSCO 基因的完整性增加了 13.1%。此外,我们表明,即使在超大型基因组组装中,例如银杏 (∼12 Gb),TGS-GapCloser 也可以用序列数据覆盖 71.6% 的缺口。结论 TGS-GapCloser 可以使用原始长读快速且经济高效地缩小大型基因组组装中的间隙。 TGS-GapCloser 生成的最终组件提高了连续性和完整性,同时保持了高精度。 该软件可在 https://github.com/BGI-Qingdao/TGS-GapCloser 获取。
更新日期:2020-09-01
down
wechat
bug