当前位置: X-MOL 学术J. Comput. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Nongreedy Unbalanced Huffman Tree Compressor for Single and Multifasta Files.
Journal of Computational Biology ( IF 1.4 ) Pub Date : 2020-06-05 , DOI: 10.1089/cmb.2019.0249
Sultan Alyami 1 , Chun-Hsi Huang 1
Affiliation  

Next-generation sequencing technologies are producing genomic data at ever-increasing rates. It has become a challenge to store, transmit, and process the massive quantity of data, creating a vital need for a tool that compresses genomic data produced in a lossless manner, thus reducing storage space and speeding up data transmission. Data centers are adopting either of the two general-purpose genomic data compressors: gzip or bzip2. Both these use Huffman encoding, although they implement it in different ways. However, neither of these two takes advantage of properties of DNA data, such as the presence of a small alphabet and many repeats. Huffman encoding compression can be improved by exploiting DNA characteristics. Recently, it has been shown that Huffman encoding compression can be improved by creating an unbalanced Huffman tree (UHT), which demonstrates significant advances in compression over a standard Huffman tree used in both gzip and bzip2. However, the UHT created is greedy. This article proposes an improved nongreedy UHT (NUHT), a lossless nonreference-based fasta and multifasta compressor. We compare our algorithm with two well-known general-purpose compressors, gzip and bzip2, as well as with UHT, a DNA-specific compressor based on Huffman tree. Our algorithm outperforms all three in terms of compression ratio and is seven times faster than UHT.

中文翻译:

用于单个和 Multifasta 文件的非贪婪不平衡霍夫曼树压缩器。

新一代测序技术正在以越来越快的速度产生基因组数据。存储、传输和处理海量数据已成为一项挑战,因此迫切需要一种工具来压缩以无损方式生成的基因组数据,从而减少存储空间并加快数据传输速度。数据中心正在采用两种通用基因组数据压缩器中的一种:gzip 或 bzip2。尽管它们以不同的方式实现,但它们都使用霍夫曼编码。然而,这两者都没有利用 DNA 数据的特性,例如存在小字母表和许多重复。霍夫曼编码压缩可以通过利用 DNA 特征来改进。最近,已经表明可以通过创建不平衡的霍夫曼树 (UHT) 来改进霍夫曼编码压缩,它展示了在 gzip 和 bzip2 中使用的标准 Huffman 树的压缩方面的重大进步。但是,创建的 UHT 是贪婪的。本文提出了一种改进的非贪婪 UHT (NUHT),一种基于无损非参考的 fasta 和 multifasta 压缩器。我们将我们的算法与两个著名的通用压缩器 gzip 和 bzip2 以及基于 Huffman 树的特定于 DNA 的压缩器 UHT 进行比较。我们的算法在压缩率方面优于所有三者,并且比 UHT 快 7 倍。我们将我们的算法与两个著名的通用压缩器 gzip 和 bzip2 以及基于 Huffman 树的特定于 DNA 的压缩器 UHT 进行比较。我们的算法在压缩率方面优于所有三者,并且比 UHT 快 7 倍。我们将我们的算法与两个著名的通用压缩器 gzip 和 bzip2 以及基于 Huffman 树的特定于 DNA 的压缩器 UHT 进行比较。我们的算法在压缩率方面优于所有三者,并且比 UHT 快 7 倍。
更新日期:2020-06-05
down
wechat
bug