Sequential and parallel algorithms for all-pair k-mismatch maximal common substrings

doi:10.1016/j.jpdc.2020.05.018

Journal of Parallel and Distributed Computing

Volume 144, October 2020, Pages 68-79

https://doi.org/10.1016/j.jpdc.2020.05.018 Get rights and content

Highlights

•
First and only known sub-quadratic algorithms for the k-mismatch maximal common substring problem.
•
Sequential algorithm runs in $O (N {log}^{k} N + occ)$ expected time, where occ is the output size.
•
Our distributed memory parallel algorithm runs in $O (((\frac{N}{p}) log N + occ) {log}^{k} N)$ expected time using $O ({log}^{k + 1} N)$ expected communication rounds, where p is the number of processors.
•
Parallel algorithm is demonstrated with high throughput genomic datasets ranging in size from 18 million to over 270 million reads, on up to 1024 processor cores.

Abstract

Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let $D$ be a collection of $n$ sequences of total length $N$ , $ϕ$ be a length threshold, and $k$ be a mismatch threshold. The goal is to identify and report all $k$ -mismatch maximal common substrings of length at least $ϕ$ over all pairs of strings in $D$ . Heuristics based on seed-and-extend style filtering techniques are often employed in such applications. However, such methods cannot provide any provably efficient run time guarantees. To this end, we present a sequential algorithm with an expected run time of $O (N {log}^{k} N + occ)$ , where $occ$ is the output size. We then present a distributed memory parallel algorithm with an expected run time of $O ((\frac{N}{p} log N + occ) {log}^{k} N)$ using $O ({log}^{k + 1} N)$ expected rounds of global communications, under some realistic assumptions, where $p$ is the number of processors. Finally, we demonstrate the performance and scalability of our algorithms using experiments on large high throughput sequencing data.

Introduction

Sequence matching algorithms are at the core of many applications in computational biology. Next Generation Sequencing (NGS) [15] instruments sequence hundreds of millions of short reads, that are typically randomly sampled from one or multiple genomes. Deciphering pairwise alignments between the reads is often the first step in many applications. For example, one may be interested in finding all pairs of reads that have a sufficiently long overlap, such as suffix/prefix overlap (for genomic or metagenomic assembly [20]), or substring overlap (for read compression [8], finding RNA sequences containing common exons [9], [17], etc.). Much of modern-day high-throughput sequencing is carried out using Illumina sequencers, which have a small error rate ( $<$ 1%–2%) and predominantly ( $> 99 %$ ) substitution errors [16]. Thus, algorithms that tolerate a small number of mismatch errors can yield the same solution as the much more expensive alignment computations. Motivated by such applications, we formulate the following all-pair $k$ -mismatch maximal common substrings problem:

Problem 1

Given a collection $D = {S_{1}, S_{2}, \dots, S_{n}}$ of $n$ sequences with $\sum_{i} | S_{i} | = N$ , a length threshold $ϕ$ , and a mismatch threshold $k \geq 0$ , report all $k$ -mismatch maximal common substrings of length $\geq ϕ$ between all pairs of sequences in $D$ .

A pair of two equal length substrings $S_{i} [x . . (x + t - 1)]$ and $S_{j} [y . . (y + t - 1)]$ is a $k$ -mismatch common substring if the hamming distance between them is $\leq k$ . Also, they are a $k$ -mismatch maximal common substring if neither $S_{i} [(x - 1) . . (x + t - 1)]$ and $S_{j} [(y - 1) . . (y + t - 1)]$ , nor $S_{i} [x . . (x + t)]$ and $S_{j} [y . . (y + t)]$ are a $k$ -mismatch common substring pair.

In this paper, we present efficient solutions for this problem in both sequential as well as parallel settings. Our sequential algorithm runs in $O (N {log}^{k} N + occ)$ expected time, where $occ$ is the output size. Our distributed memory parallel algorithm runs in $O (((N ∕ p) log N + occ) {log}^{k} N)$ expected time using $O ({log}^{k + 1} N)$ expected communication rounds, where $p$ is the number of processors. Here we make a reasonable assumption that the number of occurrence of any $τ$ -long substring across all sequences in $D$ is $O (N ∕ p)$ . Under this assumption, our algorithm enforces an effective partitioning of a series of modified suffix trees to localize processing within each processor. We demonstrate the scalability and performance of our parallel algorithm using genomic datasets ranging in size from 18 million to over 270 million reads, on up to 1024 processor cores.

To solve such problems in practice, seed-and-extend style filtering approaches are often employed. The underlying principle is: if two sequences have a $k$ -mismatch common substring of length $\geq ϕ$ , then they must have an exact common substring of length at least $τ = ⌈\frac{ϕ - k}{k + 1}⌉$ . Therefore, using a fast hashing technique, all pairs of sequences that have a $τ$ -length common substring are identified. Then, by exhaustively checking all such candidate pairs, the final output is generated. In the sequential setting, the filtering heuristics can be broadly classified under three categories: suffix filtering [12], [23], spaced seeds filtering [3], and substring filtering [21]. In case of parallel heuristic methods, filtering based methods mainly focused on the corresponding applications. For example, [11] and [19] proposed suffix tree based parallel clustering of EST data. Clearly, a filtering-based algorithm cannot provide any run time guarantees and often times the candidate pairs generated can be overwhelmingly larger than the final output size. Recently, [22] published the first and only known sub-quadratic exact sequential algorithm for this problem that includes insertions and deletions along with mismatches. Work done on accelerating pairwise edit distance estimations among $n$ sequences using sequence alignment can also be applied to this problem [18]. However, for a large number of short sequences (with low error rate, mostly mismatches), all pair sequence alignment is impractical because of its quadratic time complexity.

This paper is organized as follows. In Section 2, we introduce notations and data structures used in our algorithm. Due to the absence of a provably efficient sequential algorithm for this problem, we first design such an algorithm and present it in Section 3. In Section 4, we describe the parallel algorithm in detail and prove the claimed bounds on expected time and communication rounds. In Section 5, we discuss the implementation details of the parallel algorithm. Finally in Section 6, we discuss the results of our implementation on genomic and gene expression datasets.

Section snippets

Notation and preliminaries

Let $Σ$ denote the alphabet of the sequences in $D$ . Throughout this paper, both $| Σ |$ and $k$ are assumed to be constants. Let $T = S_{1} $_{1} S_{2} $_{2} \dots S_{n} $_{n}$ be the concatenation of all sequences in $D$ , separated by special characters $$_{1}, $_{2}, \dots, $_{n}$ . Here each $$_{i} \notin Σ$ is a unique special symbol and is lexicographically larger than all characters in $Σ$ . Clearly, there exists a one to one mapping between the positions in $T$ (except the $$_{i}$ positions) and the positions in the sequences in $D$ . Let $seq (x)$ denote the identifier of

The exact match case

As a warm up, we first show how to solve the exact match case ( $k = 0$ ) in optimal $O (N + occ)$ worst case time. First create the $GST$ , then identify all nodes $u$ in $GST$ such that $s t r i n g - d e p t h (u) \geq ϕ$ and $s t r i n g - d e p t h (parent (u)) < ϕ$ . Such nodes are termed as marked nodes. Clearly, a pair of suffixes satisfies condition (1) specified in Problem 2 iff their corresponding leaves are under the same marked node. This allows us to process the suffixes under each marked node $w$ independently as follows: let ${Suff}_{w}$

Our parallel algorithm

In this section, we show how to extend our ideas to obtain a provably efficient parallel algorithm. We assume that the input set of strings $D$ , or equivalently their concatenated string $T$ , is distributed across the $p$ processors such that each processor has the same number of total characters. Note that a maximal common substring with $k$ -mismatches is essentially a concatenation of $(k + 1)$ maximal exact matches separated by $k$ mismatch positions in between. Among the various ways the $k$ mismatches can

Implementation details

We implemented our algorithm using C++ and MPI. We use block-wise distribution for the distributed SA and LCP arrays. We refer to the elements located within a processor as its ‘local elements’ or ‘local array’.

Experimental results

We ran our experiments on an Intel Xeon Infiniband Cluster. Each node has a 2.7 GHz $24$ -core Intel Xeon $6226$ processor with $192$ GB of main memory and is running RHEL7.6 operating system. The nodes of the cluster are interconnected with EDR ( $100$ Gbps) InfiniBand. Experiments were conducted on up to 64 nodes using 16 cores per node, totaling 1,024 cores. We evaluated our algorithm on three different datasets, detailed in Table 1. Dataset D2 (NCBI SRA Accession Number SRR2984882) consists of 18.4

Conclusion

Approximate sequence matching algorithms are of significant interest in computational biology as replacement for quadratic alignment-based algorithms, particularly as high-throughput sequencers are producing large-scale datasets. In this paper, we presented a parallel algorithm for finding $k$ -mismatch, all-pair maximal common substrings between a large set of strings. While the only sub-quadratic sequential algorithm to solve this problem is the one recently proposed by [22], there is no

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research is supported in part by the U.S. National Science Foundation under IIS-1416259, CCF-1704552 and CCF-1703489.

Sriram P. Chockalingam is a research scientist in the Institute for Data Engineering and Science at the Georgia Institute of Technology, Atlanta, GA. He received his Ph.D. degree in Computer Science and Engineering from Indian Institute of Technology Bombay, India. His research interests include parallel algorithms, approximate sequence matching and systems biology.

References (24)

BerkmanO. et al.
Optimal doubly logarithmic parallel algorithms based on finding all nearest smaller values
J. Algorithms
(1993)
SarjeA. et al.
All-pairs computations on many-core graphics processors
Parallel Comput.
(2013)
ScheetzT.E. et al.
Gene transcript clustering: a comparison of parallel approaches
Future Gener. Comput. Syst.
(2005)
VälimäkiN. et al.
Approximate all-pairs suffix/prefix overlaps
Inform. and Comput.
(2012)
AdhiantoL. et al.
HPCToolkit: Tools for performance analysis of optimized parallel programs
Concurr. Comput.: Pract. Exper.
(2010)
BurkhardtS. et al.
Better filtering with gapped q-grams
Fund. inform.
(2003)
DevroyeL. et al.
A note on the height of suffix trees
SIAM J. Comput.
(1992)
Farach-ColtonM. et al.
On the sorting-complexity of suffix tree construction
J. ACM
(2000)
FischerJ. et al.
A new succinct representation of RMQ-information and improvements in the enhanced suffix array
FlickP. et al.
Parallel distributed memory construction of suffix and longest common prefix arrays

FritzM.H.-Y. et al.

Efficient storage of high throughput DNA sequencing data using reference-based compression

Genome Res.

(2011)

GrabherrM.G. et al.

Full-length transcriptome assembly from RNA-Seq data without a reference genome

Nature Biotechnol.

(2011)

Cited by (0)

Sharma V. Thankachan is an Assistant Professor in the Department of Computer Science at University of Central Florida, Orlando. He received his Ph.D. degree in Computer Science from Louisiana State University in 2014. Prior to that, received his B. Tech. degree in Electrical and Electronics Engineering from National Institute of Technology Calicut, India in 2006. His research interests include parallel algorithms, computational biology and succinct data structures.

Srinivas Aluru is the Executive Director of the Georgia Tech Interdisciplinary Research Institute (IRI) in Data Engineering and Science (IDEaS) and a professor in the School of Computational Science and Engineering within the College of Computing. He co-leads the NSF South Big Data Regional Innovation Hub which nurtures big data partnerships between organizations in the 16 Southern States and Washington D.C., and the NSF Transdisciplinary Research Institute for Advancing Data Science. Aluru conducts research in high performance computing, data science, bioinformatics and systems biology, combinatorial scientific computing, and applied algorithms. He pioneered the development of parallel methods in computational biology, and contributed to the assembly and analysis of complex plant genomes. His contributions in scientific computing lie in parallel Fast Multipole Method, domain decomposition methods, spatial data structures, and applications in computational electromagnetics and materials informatics. Aluru serves on the editorial boards of the IEEE Transactions on Big Data, ACM/IEEE Transactions on Computational Biology and Bioinformatics, the Journal of Parallel and Distributed Computing, and the International Journal of Data Mining and Bioinformatics. He is a Fellow of the American Association for the Advancement of Science (AAAS) and the Institute for Electrical and Electronic Engineers (IEEE).

View full text

Sequential and parallel algorithms for all-pair k-mismatch maximal common substrings

Highlights

Abstract

Introduction

Section snippets

Notation and preliminaries

The exact match case

Our parallel algorithm

Implementation details

Experimental results

Conclusion

Declaration of Competing Interest

Acknowledgment

J. Algorithms

Parallel Comput.

Future Gener. Comput. Syst.

Inform. and Comput.

HPCToolkit: Tools for performance analysis of optimized parallel programs

Concurr. Comput.: Pract. Exper.

Better filtering with gapped q-grams

Fund. inform.

A note on the height of suffix trees

SIAM J. Comput.

On the sorting-complexity of suffix tree construction

J. ACM

A new succinct representation of RMQ-information and improvements in the enhanced suffix array

Parallel distributed memory construction of suffix and longest common prefix arrays

Efficient storage of high throughput DNA sequencing data using reference-based compression

Genome Res.

Full-length transcriptome assembly from RNA-Seq data without a reference genome

Nature Biotechnol.

Sequential and parallel algorithms for all-pair $k$ -mismatch maximal common substrings