Sequential and parallel algorithms for all-pair k-mismatch maximal common substrings,Journal of Parallel and Distributed Computing

当前位置： X-MOL 学术 › J. Parallel Distrib. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Sequential and parallel algorithms for all-pair k-mismatch maximal common substrings
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2020-06-04 , DOI: 10.1016/j.jpdc.2020.05.018
Sriram P. Chockalingam , Sharma V. Thankachan , Srinivas Aluru

Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let $D$ be a collection of $n$ sequences of total length $N$ , $ϕ$ be a length threshold, and $k$ be a mismatch threshold. The goal is to identify and report all $k$ -mismatch maximal common substrings of length at least $ϕ$ over all pairs of strings in $D$ . Heuristics based on seed-and-extend style filtering techniques are often employed in such applications. However, such methods cannot provide any provably efficient run time guarantees. To this end, we present a sequential algorithm with an expected run time of $O (N {log}^{k} N + occ)$ , where $occ$ is the output size. We then present a distributed memory parallel algorithm with an expected run time of $O ((\frac{N}{p} log N + occ) {log}^{k} N)$ using $O ({log}^{k + 1} N)$ expected rounds of global communications, under some realistic assumptions, where $p$ is the number of processors. Finally, we demonstrate the performance and scalability of our algorithms using experiments on large high throughput sequencing data.

中文翻译：

全对的顺序和并行算法 $ķ$ -mismatch最大公共子串

在一大批序列中识别较长的成对最大公共子串是计算生物学中经常使用的结构，并应用于DNA序列聚类和组装中。由于定序器产生的错误，可以解决少量差异的算法特别受关注。正式地，让 $d$ 成为 $ñ$ 全长序列 $ñ$ ， $ϕ$ 是长度阈值，并且 $ķ$ 是不匹配的阈值。目标是识别并报告所有 $ķ$ -mismatch长度至少为最大的公共子串 $ϕ$ 在所有成对的字符串中 $d$ 。在这种应用中通常采用基于种子和扩展样式过滤技术的启发式方法。但是，此类方法无法提供任何可证明的有效运行时间保证。为此，我们提出了一种预期运行时间为 $Ø （ ñ {日志}^{ķ} ñ + occ ）$ ，在哪里 $occ$ 是输出大小。然后，我们提出了一种预期运行时间为 $Ø ((\frac{ñ}{p} 日志 ñ + occ) {日志}^{ķ} ñ)$ 使用 $Ø ({日志}^{ķ + 1个} ñ)$ 在一些现实的假设下，预期的全球通信回合 $p$ 是处理器的数量。最后，我们通过对大型高通量测序数据进行实验，证明了算法的性能和可扩展性。

更新日期：2020-06-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11