当前位置: X-MOL 学术bioRxiv. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy
bioRxiv - Bioinformatics Pub Date : 2020-10-14 , DOI: 10.1101/2020.04.19.049262
Shubham Chandak , Kedar Tatwawadi , Srivatsan Sridhar , Tsachy Weissman

Motivation: Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. Results: We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35-50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. Availability: The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation.

中文翻译:

纳米孔原始信号数据的有损压缩对碱基检出和共识精度的影响

动机:纳米孔测序为基因组测序提供了实时,便携式的解决方案,与第二代技术相比,能够实现更好的组装,结构变异发现和修饰碱基检测。排序过程会以fast5文件中包含的原始信号形式生成大量数据,必须对其进行压缩以实现有效的存储和传输。由于原始数据本质上是嘈杂的,因此有损压缩有潜力显着减少空间需求,而不会不利地影响下游应用程序的性能。结果:我们探索使用两个最先进的有损时间序列压缩器对纳米孔原始数据进行有损压缩,并评估压缩大小与碱基检出/共识准确性之间的权衡。我们在覆盖范围不同的深度的各种数据集上测试了多个基础调用程序和共识工具,并得出结论,与最新的无损压缩器相比,有损压缩可以使原始数据的压缩大小进一步降低35-50%,而忽略不计对基本通话准确度(降低≲0.2%)和共识准确度(降低≲0.002%)的影响。此外,我们评估了有损压缩对甲基化调用准确度的影响,并观察到,对于压缩大小的类似减小,这种影响是最小的,尽管需要使用改进的基准数据集进行进一步评估才能得出确定的结论。结果表明,有可能在纳米孔测序仪本身上使用有损压缩,在保持下游应用程序准确性的同时,大幅降低存储和传输成本。可用性:该代码可在https://github.com/shubhamchandak94/lossy_compression_evaluation获得。
更新日期:2020-10-15
down
wechat
bug