当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2020-10-19 , DOI: 10.1186/s12859-020-03779-w
Xingyu Liao 1 , Xin Gao 2 , Xiankai Zhang 1 , Fang-Xiang Wu 3 , Jianxin Wang 1
Affiliation  

Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.

中文翻译:

RepAHR:一种通过组装高频读数进行从头重复识别的改进方法

重复序列占真核生物基因组的很大一部分。重复序列的识别在许多应用中起着重要作用,例如结构变异检测和基因组组装。许多现有的从头重复识别管道或工具利用高频 k-mer 的组装来获得重复。但是,汇编程序需要一定程度的序列覆盖才能获得所需的程序集。另一方面,组装者将读取切割成较短的k-mers进行组装,这可能会破坏重复区域的结构。由于上述原因,利用现有的工具很难获得基因组中完整、准确的重复区域。在这项研究中,我们提出了一种称为 RepAHR 的新方法,用于通过组装高频读数进行从头重复识别。首先,RepAHR 扫描下一代测序 (NGS) 读取以找到高频 k-mer。其次,RepAHR根据高频k-mer按照一定的规则从整个NGS reads中过滤出高频reads。最后,使用 SPAdes 组装高频读数以生成重复序列,SPAdes 被认为是具有 NGS 序列的优秀基因组组装器。我们在五个数据集上测试了 RepAHR,实验结果表明 RepAHR 在 N50、参考对齐率、参考覆盖率、Repbase 掩码率和其他一些指标方面优于 RepARK 和 REPdenovo。最后,使用 SPAdes 组装高频读数以生成重复序列,SPAdes 被认为是具有 NGS 序列的优秀基因组组装器。我们在五个数据集上测试了 RepAHR,实验结果表明 RepAHR 在 N50、参考对齐率、参考覆盖率、Repbase 掩码率和其他一些指标方面优于 RepARK 和 REPdenovo。最后,使用 SPAdes 组装高频读数以生成重复序列,SPAdes 被认为是具有 NGS 序列的优秀基因组组装器。我们在五个数据集上测试了 RepAHR,实验结果表明 RepAHR 在 N50、参考对齐率、参考覆盖率、Repbase 掩码率和其他一些指标方面优于 RepARK 和 REPdenovo。
更新日期:2020-10-19
down
wechat
bug