当前位置: X-MOL 学术Genome Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Aligning distant sequences to graphs using long seed sketches
Genome Research ( IF 7 ) Pub Date : 2023-07-01 , DOI: 10.1101/gr.277659.123
Amir Joudaki 1, 2 , Alexandru Meterez 1 , Harun Mustafa 1, 2, 3 , Ragnar Groot Koerkamp 1 , André Kahles 2, 3, 4 , Gunnar Rätsch 1, 2, 3, 5
Affiliation  

Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a Formula mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of Formula. For such queries, longer sketch-based seeds yield a Formula increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.

中文翻译:

使用长种子草图将远距离序列与图形对齐

序列到图的比对对于变异基因分型、读取错误校正和基因组组装等应用至关重要。公式我们提出了一种新颖的播种方法,该方法依赖于长的不精确匹配而不是短的精确匹配,并表明它在高达突变率的设置中产生更好的时间精度权衡。我们使用图节点子集的草图,这些草图对 indels 更稳健,并将它们存储在 k-最近邻索引中以避免维数灾难。我们的方法与现有方法形成对比,并强调了向量空间草图在生物信息学应用中可以发挥的重要作用。我们表明,我们的方法可以扩展到具有 10 亿个节点的图,并且对于编辑距离为 的查询具有准对数查询时间公式公式对于此类查询,与精确种子相比,较长的基于草图的种子会提高召回率。我们的方法可以合并到其他对齐器中,为序列到图的对齐提供新的方向。
更新日期:2023-07-01
down
wechat
bug