Lossless indexing with counting de Bruijn graphs,Genome Research

当前位置： X-MOL 学术 › Genome Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Lossless indexing with counting de Bruijn graphs
Genome Research ( IF 6.2 ) Pub Date : 2022-09-01 , DOI: 10.1101/gr.276607.122
Mikhail Karasikov _{1,

2,

3} , Harun Mustafa _{1,

2,

3} , Gunnar Rätsch _{1,

2,

3,

4,

5} , André Kahles _{1,

2,

3}

Affiliation

Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node–label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.

中文翻译：

使用 de Bruijn 图计数进行无损索引

测序数据在公共存储库中迅速积累。要使该资源可用于大规模交互式分析，需要有效的存储和索引方法。最近，在构建带注释（或彩色） de Bruijn 图的压缩表示以有效索引k聚体集方面取得了显着进展。然而，以一般方式表示定量属性（例如基因表达或基因组位置）的方法仍未得到充分探索。在这项工作中，我们提出了计数 de Bruijn 图，这是一种通过用一个或多个属性（例如， k聚体计数或其位置）补充每个节点-标签关系来概括带注释的 de Bruijn 图的概念。与最先进的生物信息学工具相比，Counting de Bruijn 图索引了 2652 个人类 RNA-seq 样本中的k聚体丰度，其表示形式小了八倍以上，并且构建和查询速度更快。此外，带有位置注释的 de Bruijn 图计数无损地表示索引中的完整读取，平均比人类 Illumina RNA-seq 的 gzip 压缩输入小 27%，对于 Pacific Biosciences (PacBio) 病毒样本的 HiFi 测序平均小 57%。来自 NCBI 序列读取档案 (SRA)（152,884 个样本，875 Gbp）的所有病毒 PacBio SMRT 读取的完整可搜索索引仅包含 178 GB。最后，在完整的 RefSeq 集合上，我们生成一个无损且完全可查询的索引，该索引比 MegaBLAST 索引小 4.6 倍。这项工作中提出的技术自然地补充了使用 de Bruijn 图的现有方法和工具，并显着拓宽了它们的适用性：从索引k聚体计数和基因组位置到在高度压缩的基于图的序列索引之上实现新颖的序列比对算法。

更新日期：2022-09-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11