Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases,Journal of Bioinformatics and Computational Biology

当前位置： X-MOL 学术 › J. Bioinform. Comput. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases
Journal of Bioinformatics and Computational Biology ( IF 1 ) Pub Date : 2020-11-08 , DOI: 10.1142/s0219720020500481
Tao Tang ₁ , Jinyan Li ₁

Affiliation

FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of the inter-similarity between the data sets which can be otherwise exploited to enhance compression performance of de novo compression. We show that clustering these data sets into similar sub-groups for a group-by-group compression can greatly improve the compression performance. Our novel idea is to detect the lexicographically smallest k-mer (k-minimizer) for every read in each data set, and uses these k-mers as features and their frequencies in every data set as feature values to transform these huge data sets each into a characteristic feature vector. Unsupervised clustering algorithms are then applied to these vectors to find similar data sets and merge them. As the amount of common k-mers of similar feature values between two data sets implies an excessive proportion of overlapping reads shared between the two data sets, merging similar data sets creates immense sequence redundancy to boost the compression performance. Experiments confirm that our clustering approach can gain up to 12% improvement over several state-of-the-art algorithms in compressing reads databases consisting of 17–100 data sets (48.57–197.97[Formula: see text]GB).

中文翻译：

将 FASTA 文件转换为特征向量，用于短读数据库的无监督压缩

用于生物医学研究的 FASTA 短读长数据集通常生成数十或数百个。然而，这些数据集的当前压缩是一个接一个地执行的，没有考虑数据集之间的相互相似性，否则可以利用这些相互相似性来提高从头压缩的压缩性能。我们表明，将这些数据集聚类到类似的子组中进行逐组压缩可以大大提高压缩性能。我们的新想法是为每个数据集中的每次读取检测字典上最小的 k-mer (k-minimizer)，并将这些 k-mer 作为特征，并将它们在每个数据集中的频率作为特征值，以对这些庞大的数据集进行每个转换成特征特征向量。然后将无监督聚类算法应用于这些向量以找到相似的数据集并将它们合并。由于两个数据集之间相似特征值的公共 k-mer 的数量意味着两个数据集之间共享的重叠读取的比例过高，因此合并相似的数据集会产生巨大的序列冗余以提高压缩性能。实验证实，我们的聚类方法在压缩由 17-100 个数据集（48.57-197.97[公式：见文本]GB）组成的读取数据库方面可以比几种最先进的算法获得高达 12% 的改进。合并相似的数据集会产生巨大的序列冗余以提高压缩性能。实验证实，我们的聚类方法在压缩由 17-100 个数据集（48.57-197.97[公式：见文本]GB）组成的读取数据库方面可以比几种最先进的算法获得高达 12% 的改进。合并相似的数据集会产生巨大的序列冗余以提高压缩性能。实验证实，我们的聚类方法在压缩由 17-100 个数据集（48.57-197.97[公式：见文本]GB）组成的读取数据库方面可以比几种最先进的算法获得高达 12% 的改进。

更新日期：2020-11-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>