当前位置: X-MOL 学术Genet. Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An efficient classification algorithm for NGS data based on text similarity.
Genetics Research ( IF 1.5 ) Pub Date : 2018-09-18 , DOI: 10.1017/s0016672318000058
Xiangyu Liao 1 , Xingyu Liao 2 , Wufei Zhu 3 , Lu Fang 3 , Xing Chen 3
Affiliation  

With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.

中文翻译:

一种基于文本相似度的NGS数据有效分类算法。

随着高通量测序技术的发展,可用测序数据的数量正以一定的速度增长,现在已经开始极大地挑战现代计算机系统的数据处理和存储能力。通过群集从此类数据中删除冗余对于减少内存,磁盘空间和运行时间消耗可能至关重要。此外,在某些分析应用程序中,它在减少数据集噪声方面也具有良好的性能。在这项研究中,我们提出了一种基于高效哈希函数和文本相似度的下一代测序(NGS)数据的高性能短序列分类算法(HSC)。首先,HSC将所有读段转换为k-mer,然后通过合并重复的和反向互补的元件形成一个独特的k-mer集。其次,所有唯一的k-mers都存储在哈希表中,其中k-mer字符串存储在键字段中,而包含k-mer的读段的ID存储在value字段中。第三,将每个哈希单元转换为包含读取的短文本。第四,将满足相似性阈值的文本组合成一个长文本,迭代执行合并操作,直到没有满足合并条件的文本为止。最后,长文本被转换为由阅读组成的簇。我们使用五个真实的数据集测试了HSC。实验结果表明,HSC在2小时内群集了1亿个短读,并且在减少内存消耗方面具有出色的性能。与现有方法相比,HSC比其他工具快得多,它可以轻松处理数千万个序列。此外,将HSC用作生成装配数据的预处理工具时,大大减少了汇编程序的内存和时间消耗。它可以帮助组装者在N50,NA50和基因组分数方面实现更好的组装。
更新日期:2019-11-01
down
wechat
bug