当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
LCQS: an efficient lossless compression tool of quality scores with random access functionality.
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2020-03-18 , DOI: 10.1186/s12859-020-3428-7
Jiabing Fu 1, 2 , Bixin Ke 1, 2 , Shoubin Dong 1, 2
Affiliation  

Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance. In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases. The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.

中文翻译:

LCQS:具有随机访问功能的有效质量得分无损压缩工具。

先进的测序仪极大地加速了基因组数据的生成,这使得对测序数据进行有效压缩的需求变得极为紧迫和迫切。作为标准测序数据格式FASTQ的最困难部分,质量得分的压缩已成为FASTQ压缩开发中的难题。现有的质量分数无损压缩器主要利用特定音序器生成的特定模式和复杂的上下文建模技术来解决低压缩率的问题。但是,这些压缩器的主要缺点是鲁棒性差的问题,这意味着排序文件的结果不稳定甚至不可用,以及压缩速度慢的问题。与此同时,一些压缩器试图构造细粒度的索引结构以解决随机访问解压缩速度慢的问题。但是,它们以牺牲压缩速度为代价并以大索引文件为代价解决了该问题,这使其效率低下且不切实际。因此,迫切需要一种具有强健,高压缩比,快速压缩和随机访问解压缩速度的质量分数的高效无损压缩器,这是非常重要的。在本文中,基于最大限度地利用硬件资源的想法,提出了专用于质量得分的无损压缩工具LCQS。它包含四个顺序的处理步骤:分区,索引,打包和并行化。实验结果表明,除了数据集SRR1284073上的压缩速度外,LCQS在所有标准上均优于其他所有最新的压缩机。此外,LCQS在所有测试数据集上均表现出强大的鲁棒性,其压缩速度的加速比增加了29.1倍,文件大小减少了28.78%,随机访问解压缩速度增加了2.1倍。此外,LCQS还具有强大的可扩展性。也就是说,随着输入数据集大小的增加,压缩速度几乎呈线性增加。处理各种不同质量得分的能力以及压缩率和压缩速度的优越性,使LCQS成为高效,先进的无损质量得分压缩器,并具有快速随机访问解压缩的优势。
更新日期:2020-04-22
down
wechat
bug