当前位置: X-MOL 学术Bioinformatics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CIndex: compressed indexes for fast retrieval of FASTQ files
Bioinformatics ( IF 5.8 ) Pub Date : 2021-09-10 , DOI: 10.1093/bioinformatics/btab655
Hongwei Huo 1 , Pengfei Liu 1 , Chenhui Wang 1 , Hongbo Jiang 1 , Jeffrey Scott Vitter 2
Affiliation  

Motivation Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows–Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7–41.66% points less space and provides a speedup of 70–167.16 times, 1.44–35.57 times and 1.3–55.4 times. For extracting records in FASTQ files, our method uses 2.86–14.88% points less space and provides a speedup of 3.13–20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. Availability and implementation The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. Supplementary information Supplementary data are available at Bioinformatics online.

中文翻译:

CIndex:用于快速检索 FASTQ 文件的压缩索引

动机 超高通量的下一代测序仪器继续产生大量的基因组数据。这些数据一般以FASTQ格式存储。两个重要的同步目标是基因组数据的空间高效压缩存储和快速查询性能。为此,我们引入了压缩索引来存储和检索 FASTQ 文件。结果 我们提出了一个名为 CIndex 的 FASTQ 文件压缩索引。CIndex 使用 Burrows–Wheeler 变换和小波树,结合混合编码、简洁的数据结构和表 REF 和 Rγ,以实现对压缩 FASTQ 文件的最小空间使用和快速检索。在来自各种测序仪器的真实公开可用数据集上进行的实验表明,我们提出的索引大大优于现有的最先进的解决方案。对于读取的计数、定位和提取查询,我们的方法使用的空间减少了 2.7-41.66%,并提供了 70-167.16 倍、1.44-35.57 倍和 1.3-55.4 倍的加速。为了提取 FASTQ 文件中的记录,我们的方法使用的空间减少了 2.86-14.88%,并提供了 3.13-20.1 倍的加速。CIndex 有一个额外的优势,它可以很容易地适应作为通用文本索引的工作;实验表明它在实践中表现非常好。可用性和实施​​该软件可在 Github 上获得:https://github.com/Hongweihuo-Lab/CIndex。补充信息 补充数据可在 Bioinformatics 在线获取。
更新日期:2021-09-10
down
wechat
bug