Sequential and parallel algorithms for all-pair -mismatch maximal common substrings
Introduction
Sequence matching algorithms are at the core of many applications in computational biology. Next Generation Sequencing (NGS) [15] instruments sequence hundreds of millions of short reads, that are typically randomly sampled from one or multiple genomes. Deciphering pairwise alignments between the reads is often the first step in many applications. For example, one may be interested in finding all pairs of reads that have a sufficiently long overlap, such as suffix/prefix overlap (for genomic or metagenomic assembly [20]), or substring overlap (for read compression [8], finding RNA sequences containing common exons [9], [17], etc.). Much of modern-day high-throughput sequencing is carried out using Illumina sequencers, which have a small error rate ( 1%–2%) and predominantly () substitution errors [16]. Thus, algorithms that tolerate a small number of mismatch errors can yield the same solution as the much more expensive alignment computations. Motivated by such applications, we formulate the following all-pair -mismatch maximal common substrings problem:
Problem 1 Given a collection of sequences with , a length threshold , and a mismatch threshold , report all -mismatch maximal common substrings of length between all pairs of sequences in .
A pair of two equal length substrings and is a -mismatch common substring if the hamming distance between them is . Also, they are a -mismatch maximal common substring if neither and , nor and are a -mismatch common substring pair.
In this paper, we present efficient solutions for this problem in both sequential as well as parallel settings. Our sequential algorithm runs in expected time, where is the output size. Our distributed memory parallel algorithm runs in expected time using expected communication rounds, where is the number of processors. Here we make a reasonable assumption that the number of occurrence of any -long substring across all sequences in is . Under this assumption, our algorithm enforces an effective partitioning of a series of modified suffix trees to localize processing within each processor. We demonstrate the scalability and performance of our parallel algorithm using genomic datasets ranging in size from 18 million to over 270 million reads, on up to 1024 processor cores.
To solve such problems in practice, seed-and-extend style filtering approaches are often employed. The underlying principle is: if two sequences have a -mismatch common substring of length , then they must have an exact common substring of length at least . Therefore, using a fast hashing technique, all pairs of sequences that have a -length common substring are identified. Then, by exhaustively checking all such candidate pairs, the final output is generated. In the sequential setting, the filtering heuristics can be broadly classified under three categories: suffix filtering [12], [23], spaced seeds filtering [3], and substring filtering [21]. In case of parallel heuristic methods, filtering based methods mainly focused on the corresponding applications. For example, [11] and [19] proposed suffix tree based parallel clustering of EST data. Clearly, a filtering-based algorithm cannot provide any run time guarantees and often times the candidate pairs generated can be overwhelmingly larger than the final output size. Recently, [22] published the first and only known sub-quadratic exact sequential algorithm for this problem that includes insertions and deletions along with mismatches. Work done on accelerating pairwise edit distance estimations among sequences using sequence alignment can also be applied to this problem [18]. However, for a large number of short sequences (with low error rate, mostly mismatches), all pair sequence alignment is impractical because of its quadratic time complexity.
This paper is organized as follows. In Section 2, we introduce notations and data structures used in our algorithm. Due to the absence of a provably efficient sequential algorithm for this problem, we first design such an algorithm and present it in Section 3. In Section 4, we describe the parallel algorithm in detail and prove the claimed bounds on expected time and communication rounds. In Section 5, we discuss the implementation details of the parallel algorithm. Finally in Section 6, we discuss the results of our implementation on genomic and gene expression datasets.
Section snippets
Notation and preliminaries
Let denote the alphabet of the sequences in . Throughout this paper, both and are assumed to be constants. Let be the concatenation of all sequences in , separated by special characters . Here each is a unique special symbol and is lexicographically larger than all characters in . Clearly, there exists a one to one mapping between the positions in (except the positions) and the positions in the sequences in . Let denote the identifier of
The exact match case
As a warm up, we first show how to solve the exact match case () in optimal worst case time. First create the , then identify all nodes in such that and . Such nodes are termed as marked nodes. Clearly, a pair of suffixes satisfies condition (1) specified in Problem 2 iff their corresponding leaves are under the same marked node. This allows us to process the suffixes under each marked node independently as follows: let
Our parallel algorithm
In this section, we show how to extend our ideas to obtain a provably efficient parallel algorithm. We assume that the input set of strings , or equivalently their concatenated string , is distributed across the processors such that each processor has the same number of total characters. Note that a maximal common substring with -mismatches is essentially a concatenation of maximal exact matches separated by mismatch positions in between. Among the various ways the mismatches can
Implementation details
We implemented our algorithm using C++ and MPI. We use block-wise distribution for the distributed SA and LCP arrays. We refer to the elements located within a processor as its ‘local elements’ or ‘local array’.
Experimental results
We ran our experiments on an Intel Xeon Infiniband Cluster. Each node has a 2.7 GHz -core Intel Xeon processor with GB of main memory and is running RHEL7.6 operating system. The nodes of the cluster are interconnected with EDR ( Gbps) InfiniBand. Experiments were conducted on up to 64 nodes using 16 cores per node, totaling 1,024 cores. We evaluated our algorithm on three different datasets, detailed in Table 1. Dataset D2 (NCBI SRA Accession Number SRR2984882) consists of 18.4
Conclusion
Approximate sequence matching algorithms are of significant interest in computational biology as replacement for quadratic alignment-based algorithms, particularly as high-throughput sequencers are producing large-scale datasets. In this paper, we presented a parallel algorithm for finding -mismatch, all-pair maximal common substrings between a large set of strings. While the only sub-quadratic sequential algorithm to solve this problem is the one recently proposed by [22], there is no
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This research is supported in part by the U.S. National Science Foundation under IIS-1416259, CCF-1704552 and CCF-1703489.
Sriram P. Chockalingam is a research scientist in the Institute for Data Engineering and Science at the Georgia Institute of Technology, Atlanta, GA. He received his Ph.D. degree in Computer Science and Engineering from Indian Institute of Technology Bombay, India. His research interests include parallel algorithms, approximate sequence matching and systems biology.
References (24)
- et al.
Optimal doubly logarithmic parallel algorithms based on finding all nearest smaller values
J. Algorithms
(1993) - et al.
All-pairs computations on many-core graphics processors
Parallel Comput.
(2013) - et al.
Gene transcript clustering: a comparison of parallel approaches
Future Gener. Comput. Syst.
(2005) - et al.
Approximate all-pairs suffix/prefix overlaps
Inform. and Comput.
(2012) - et al.
HPCToolkit: Tools for performance analysis of optimized parallel programs
Concurr. Comput.: Pract. Exper.
(2010) - et al.
Better filtering with gapped q-grams
Fund. inform.
(2003) - et al.
A note on the height of suffix trees
SIAM J. Comput.
(1992) - et al.
On the sorting-complexity of suffix tree construction
J. ACM
(2000) - et al.
A new succinct representation of RMQ-information and improvements in the enhanced suffix array
- et al.
Parallel distributed memory construction of suffix and longest common prefix arrays
Efficient storage of high throughput DNA sequencing data using reference-based compression
Genome Res.
Full-length transcriptome assembly from RNA-Seq data without a reference genome
Nature Biotechnol.
Cited by (0)
Sriram P. Chockalingam is a research scientist in the Institute for Data Engineering and Science at the Georgia Institute of Technology, Atlanta, GA. He received his Ph.D. degree in Computer Science and Engineering from Indian Institute of Technology Bombay, India. His research interests include parallel algorithms, approximate sequence matching and systems biology.
Sharma V. Thankachan is an Assistant Professor in the Department of Computer Science at University of Central Florida, Orlando. He received his Ph.D. degree in Computer Science from Louisiana State University in 2014. Prior to that, received his B. Tech. degree in Electrical and Electronics Engineering from National Institute of Technology Calicut, India in 2006. His research interests include parallel algorithms, computational biology and succinct data structures.
Srinivas Aluru is the Executive Director of the Georgia Tech Interdisciplinary Research Institute (IRI) in Data Engineering and Science (IDEaS) and a professor in the School of Computational Science and Engineering within the College of Computing. He co-leads the NSF South Big Data Regional Innovation Hub which nurtures big data partnerships between organizations in the 16 Southern States and Washington D.C., and the NSF Transdisciplinary Research Institute for Advancing Data Science. Aluru conducts research in high performance computing, data science, bioinformatics and systems biology, combinatorial scientific computing, and applied algorithms. He pioneered the development of parallel methods in computational biology, and contributed to the assembly and analysis of complex plant genomes. His contributions in scientific computing lie in parallel Fast Multipole Method, domain decomposition methods, spatial data structures, and applications in computational electromagnetics and materials informatics. Aluru serves on the editorial boards of the IEEE Transactions on Big Data, ACM/IEEE Transactions on Computational Biology and Bioinformatics, the Journal of Parallel and Distributed Computing, and the International Journal of Data Mining and Bioinformatics. He is a Fellow of the American Association for the Advancement of Science (AAAS) and the Institute for Electrical and Electronic Engineers (IEEE).