当前位置: X-MOL 学术Curr. Genomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
Current Genomics ( IF 2.6 ) Pub Date : 2019-02-27 , DOI: 10.2174/1389202919666181026101326
Swati C Manekar 1 , Shailesh R Sathe 1
Affiliation  

Background: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. Objective: In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. Methods: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. Results: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. Conclusion: The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research

中文翻译:

估计基因组数据集中的 k-mer 覆盖频率:对最新技术的比较评估

背景:在生物信息学中,k-mer 丰度直方图的估计或仅枚举唯一 k-mer 的数量和单例的数量在许多基因组序列分析应用中是可取的。应用包括预测基因组大小、de Bruijn 图组装方法的数据预处理(调整分析工具的运行时参数)、重复检测、测序覆盖率估计、测量测序错误率等。近年来已经开发了用于测序数据中基数估计的不同方法。目标:在本文中,我们对不同的 k-mer 频率估计程序(ntCard、KmerGenie、KmerStream 和 Khmer(abundance-dist-single.py 和 unique-kmers.py)进行比较评估,以评估它们的相对优点和缺点。方法:主要是,这些工具的错误计数/错误率通过严格的实验分析对不同范围的 k 进行分析。我们还展示了关于运行时、更大数据集的可扩展性、内存、CPU 利用率以及 k-mer 频率估计方法的并行性的实验结果。结果:结果表明,与其他方法相比,ntCard 在估计 F0、f1 和全 k-mer 丰度直方图方面更加准确。ntCard 是最快的,但与 KmerGenie 相比,它需要更多的内存。结论:该评估的结果可以作为估计 k-mer 覆盖频率的流算法的潜在用户和从业者的路线图,以帮助他们确定适当的方法。这种结果分析还有助于研究人员发现剩余的开放研究问题,
更新日期:2019-02-27
down
wechat
bug