当前位置: X-MOL 学术Genome Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic
Genome Research ( IF 7 ) Pub Date : 2023-07-01 , DOI: 10.1101/gr.277637.122
Jim Shaw 1 , Yun William Yu 2, 3
Affiliation  

Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length ∼n that is indexed (or seeded) and a mutated substring of length ∼mn with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is O(mnf(θ) log n), where f(θ) < 2.43 · θ holds as a loose bound. The alignment also turns out to be good; we prove that more than Formula fraction of the homologous bases is recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, that is, only a subset of all k-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f(θ) can be further reduced.

中文翻译:

通过种子链扩展启发式的平均情况分析,证明序列对齐器可以在几乎 O(m log n) 时间内保证准确性

使用k聚体种子进行种子链延伸是现代序列比对器使用的一种强大的启发式序列比对技术。尽管在实践中对于运行时间和准确性都有效,但种子链扩展并不存在对结果对齐的理论保证。在这项工作中,我们给出了预期的k聚体的种子链延伸功效的第一个严格界限。假设我们得到一个长度为 ∼ n且已索引(或种子)的随机核苷酸序列和一个长度为 ∼ mn且突变率 θ < 0.206 的突变子串。我们证明,我们可以找到k = θ(log n ) k聚体大小,使得在最佳线性间隙成本链和二次时间间隙扩展下种子链扩展的预期运行时间为O ( mn f (θ ) log n ),其中f (θ) < 2.43 · θ 保持松散界限。事实证明,对齐效果也很好;我们证明了在最佳链下可以恢复超过一小部分的同源碱基。我们还表明,当绘制k聚体时,我们的边界有效,即仅选择所有k聚体的子集,并且绘制草图可以减少链接时间,而不会增加对齐时间或降低准确性太多,证明绘制草图的有效性为序列比对的实际加速。我们在模拟和真实的噪声长读取数据上验证了我们的结果,并表明我们的理论运行时间可以准确地预测实际运行时间。我们推测我们的界限可以进一步提高,特别是f (θ) 可以进一步减小。公式
更新日期:2023-07-01
down
wechat
bug