当前位置: X-MOL 学术bioRxiv. Genom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient storage and analysis of quantitative genomics data with the Dense Depth Data Dump (D4) format and d4tools
bioRxiv - Genomics Pub Date : 2020-10-26 , DOI: 10.1101/2020.10.23.352567
Hao Hou , Brent Pedersen , Aaron Quinlan

Modern DNA sequencing is used as a readout for diverse assays, with the count of aligned sequences, or "read depth", serving as the quantitative signal for many underlying cellular phenomena. Despite wide use and thousands of datasets, existing formats used for the storage and analysis of read depths are limited with respect to both file size and analysis speed. For example, it is faster to recalculate sequencing depth from an alignment file than it is to analyze the text output from that calculation. We sought to improve on existing formats such as BigWig and compressed BED files by creating the Dense Depth Data Dump (D4) format and tool suite. The D4 format is adaptive in that it profiles a random sample of aligned sequence depth from the input BAM or CRAM file to determine an optimal encoding that often affords reductions in file size, while also enabling fast data access. We show that D4 uses less storage for both RNA-Seq and whole-genome sequencing and offers 3 to 440- fold speed improvements over existing formats for random access, aggregation and summarization. This performance enables scalable downstream analyses that would be otherwise difficult. The D4 tool suite (d4tools) is freely available under an MIT license at: https://github.com/38/d4-format.

中文翻译:

使用密集深度数据转储(D4)格式和d4tools有效存储和分析定量基因组学数据

现代DNA测序被用作各种测定的读数,其比对序列的计数或“读取深度”用作许多潜在细胞现象的定量信号。尽管使用广泛并且有数千个数据集,但是用于存储和分析读取深度的现有格式在文件大小和分析速度方面都受到限制。例如,从比对文件重新计算测序深度比分析从该计算输出的文本要快。我们试图通过创建密集深度数据转储(D4)格式和工具套件来改进BigWig和压缩的BED文件等现有格式。D4格式具有自适应性,因为它可以分析来自输入BAM或CRAM文件的对齐序列深度的随机样本,以确定通常可以减小文件大小的最佳编码,同时还可以实现快速的数据访问。我们发现,D4在RNA-Seq和全基因组测序中使用的存储量更少,与现有格式相比,其随机访问,聚合和汇总速度提高了3到440倍。这种性能可以实现可扩展的下游分析,否则将很难进行分析。D4工具套件(d4tools)在MIT许可下可免费获得:https://github.com/38/d4-format。
更新日期:2020-10-30
down
wechat
bug