Disk compression of k-mer sets,Algorithms for Molecular Biology

当前位置： X-MOL 学术 › Algorithms Mol. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Disk compression of k-mer sets
Algorithms for Molecular Biology ( IF 1 ) Pub Date : 2021-06-21 , DOI: 10.1186/s13015-021-00192-7
Amatur Rahman ₁ , Rayan Chikhi ₂ , Paul Medvedev ₁

Affiliation

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

中文翻译：

k-mer 集的磁盘压缩

基于 K-mer 的方法已经在生物信息学的许多领域变得普遍。在数据库搜索等应用程序中，它们通常使用大型数 TB 大小的数据集。存储如此大的数据集不利于工具开发人员、工具用户和可重复性工作。像 gzip 这样的通用压缩器，或那些专为读取数据而设计的压缩器，是次优的，因为它们没有考虑 k-mer 集中的特定冗余模式。在我们早期的工作中（Rahman 和 Medvedev，RECOMB 2020），我们提出了一种 UST-Compress 算法，该算法使用频谱保留字符串集表示将一组 k-mer 压缩到磁盘。在本文中，我们提出了两种改进的 k-mer 集磁盘压缩方法，称为 ESS-Compress 和 ESS-Tip-Compress。他们使用更宽松的字符串集表示概念来进一步从 UST-Compress 的表示中删除冗余。我们在理论和实际数据上探索他们的行为。我们表明，它们在广泛的数据集上将 UST-Compress 实现的压缩大小提高了 27%。我们还推导出了这种类型的压缩策略希望达到的效果的下限。

更新日期：2021-06-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>