Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2020-06-04 , DOI: 10.1016/j.jpdc.2020.05.018 Sriram P. Chockalingam , Sharma V. Thankachan , Srinivas Aluru
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let be a collection of sequences of total length , be a length threshold, and be a mismatch threshold. The goal is to identify and report all -mismatch maximal common substrings of length at least over all pairs of strings in . Heuristics based on seed-and-extend style filtering techniques are often employed in such applications. However, such methods cannot provide any provably efficient run time guarantees. To this end, we present a sequential algorithm with an expected run time of , where is the output size. We then present a distributed memory parallel algorithm with an expected run time of using expected rounds of global communications, under some realistic assumptions, where is the number of processors. Finally, we demonstrate the performance and scalability of our algorithms using experiments on large high throughput sequencing data.
中文翻译:
全对的顺序和并行算法 -mismatch最大公共子串
在一大批序列中识别较长的成对最大公共子串是计算生物学中经常使用的结构,并应用于DNA序列聚类和组装中。由于定序器产生的错误,可以解决少量差异的算法特别受关注。正式地,让 成为 全长序列 , 是长度阈值,并且 是不匹配的阈值。目标是识别并报告所有-mismatch长度至少为最大的公共子串 在所有成对的字符串中 。在这种应用中通常采用基于种子和扩展样式过滤技术的启发式方法。但是,此类方法无法提供任何可证明的有效运行时间保证。为此,我们提出了一种预期运行时间为,在哪里 是输出大小。然后,我们提出了一种预期运行时间为 使用 在一些现实的假设下,预期的全球通信回合 是处理器的数量。最后,我们通过对大型高通量测序数据进行实验,证明了算法的性能和可扩展性。