Aligning distant sequences to graphs using long seed sketches
- Amir Joudaki1,2,5,
- Alexandru Meterez1,5,
- Harun Mustafa1,2,3,
- Ragnar Groot Koerkamp1,
- André Kahles1,2,3 and
- Gunnar Rätsch1,2,3,4
- 1Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland;
- 2University Hospital Zurich, Biomedical Informatics Research, Zurich 8091, Switzerland;
- 3Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland;
- 4ETH AI Center, 8092 Zurich, Switzerland
-
↵5 These authors contributed equally to this work.
Abstract
Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of . For such queries, longer sketch-based seeds yield a increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.
Footnotes
-
[Supplemental material is available for this article.]
-
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.277659.123.
- Received January 5, 2023.
- Accepted April 16, 2023.
This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.