Efficient Colored de Bruijn Graph for Indexing Reads.,Journal of Computational Biology

当前位置： X-MOL 学术 › J. Comput. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient Colored de Bruijn Graph for Indexing Reads.
Journal of Computational Biology ( IF 1.4 ) Pub Date : 2023-04-28 , DOI: 10.1089/cmb.2022.0259
Nozomi Hasegawa ₁ , Kana Shimizu _{1,

2}

Affiliation

The colored de Bruijn graph is a variation of the de Bruijn graph that has recently been utilized for indexing sequencing reads. Although state-of-the-art methods have achieved small index sizes, they produce many read-incoherent paths that tend to cover the same regions in the source genome sequence. To solve this problem, we propose an accurate coloring method that can reduce the generation of read-incoherent paths by utilizing different colors for a single read depending on the position in the read, which reduces ambiguous coloring in cases where a node has two successors, and both of the successors have the same color. To avoid having to memorize the order of the colors, we utilize a hash function to generate and reproduce the series of colors from the initial color and then apply a Bloom filter for storing the colors to reduce the index size. Experimental results using simulated data and real data demonstrate that our method reduces the occurrence of read-incoherent paths from 149,556 to only 2 and 5596 to 0 respectively. Moreover, the depths of coverage for the reconstructed reads are equal to those for the input reads for the simulated data, whereas the previous method decreases the depth of coverage at many positions in the source genome. Our method achieves quite a high accuracy with a comparable construction time, peak memory size, and index size to the previous method.

中文翻译：

用于索引读取的高效彩色 de Bruijn 图。

彩色 de Bruijn 图是 de Bruijn 图的变体，最近用于索引测序读数。尽管最先进的方法已经实现了较小的索引大小，但它们产生了许多读取不连贯的路径，这些路径往往覆盖源基因组序列中的相同区域。为了解决这个问题，我们提出了一种准确的着色方法，该方法可以根据读取中的位置对单个读取使用不同的颜色来减少读取不相干路径的生成，从而减少节点有两个后继节点时的模糊着色，并且两个继承者都有相同的颜色。为了避免记住颜色的顺序，我们利用散列函数从初始颜色生成并再现一系列颜色，然后应用布隆过滤器来存储颜色以减少索引大小。使用模拟数据和真实数据的实验结果表明，我们的方法将读取不相干路径的发生率分别从 149,556 减少到仅 2 和 5596 减少到 0。此外，重建读数的覆盖深度等于模拟数据的输入读数的覆盖深度，而先前的方法降低了源基因组中许多位置的覆盖深度。我们的方法实现了相当高的准确度，并且构建时间、峰值内存大小和索引大小与之前的方法相当。

更新日期：2023-04-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11