Research noteA high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes
Highlights
► A domain decomposition strategy for multiple alignments of pyrosequencing short reads. ► Domain decomposition & characteristics from pyroreads exploited to enhance parallelism. ► Experiments involve a large number of reads (up to 0.5 million) from multiple genomes. ► Algorithm (using MPI) exhibits superlinear speedups for a large number of reads. ► Rigorous quality assessment comparing multiple alignments and mapping.
Introduction
For over a decade, Sanger sequencing has been the cornerstone of genome sequencing including that of microbial genomes. Improvements in DNA sequencing techniques and advances in data storage and analysis, as well as developments in bioinformatics have reduced the cost to a mere $8000–$10 000 per megabase of high quality genome draft sequence. However, the need of more efficient and cost effective approaches has led to the development of new sequencing technologies such as the 454 GS20 sequencing platform. It is a non-cloning pyrosequencing based platform that is capable of generating 1 million overlapping reads in a single run. However, multitude of factors, such as relatively short read lengths as compared to Sanger (as of year 2010: an average of 100–400 nt length reads compared to 800–1000 nt length reads for Sanger sequencing), lack of a paired end protocol, and limited accuracy of individual reads for repetitive DNA, particularly in the case of monopolymer repeats, present many computational challenges [17] to make pyrosequencing useful for biology and bioinformatics applications.
One of the most widely employed preprocessing step for many applications, including haplotype reconstruction [10], [51], analysis of microbial community analysis [29], analysis of genes for diseases [18], is the alignment of these reads with the wild type. For important applications such as viral population estimation or haplotype reconstruction of various viruses e.g., HIV in a population, scientists usually have the information about the wild type genome of the virus. To this date, numerous tools have been suggested for mapping short reads to the single reference genome. These methods include strategies for indexing substrings of short reads or references with the use of -mer counts or space seeds. The list of academic tools available for such mapping include Bowtie, BWA, CloudBurst, MAQ, MOM, MosaikAligner, mrFAST, mrsFAST, Pash, PASS, PatMaN, RazorS, RMAP, SeqMap, SHRiMP, SliderII, SOAP, SOAP2, ssaha2 [2], [4], [6], [15], [19], [23], [25], [28], [26], [30], [31], [35], [38], [39], [43], [47], [52], [1], and commercial tools such as ZOOM [27]. The current demand in alignment can be met with new indexing strategies and efficient data processing [24]. However, these strategies are usually at the cost of simplifying the mapping problem and not allowing complex alignments, including gaps or alignment with multiple reference genomes.
Natural strains of genomes that have a high level of individual difference, such as natural inbred strains of Arabidopsis or divergent strains of HIV, pose a considerable alignment challenge [45]. It has been shown in [3], [53] that several percent of reference genome can be missing or very divergent in distinct strains of different species. This poses a problem for pairwise mapping algorithms because the mappings can result in regions that are inaccessible when aligned to the wildtype of the species. This is particularly true for algorithms that do not take into account mismatches and gaps. Fig. 1 shows an example set of reads from different strains of genomes from the same species.
Apart from the above mentioned difficulties in using pairwise mapping algorithms for short reads, aligning against only a single reference has its own shortcomings. Aligning against a single reference biases the analysis toward a comparison within the sequence space highly conserved with the reference. Taking into account all the strains of the genome reduces this bias. Aligning each read individually with each reference genome loses important information such as variation SNP’s in different strains, besides increasing computation time and memory. Therefore, to interpret the alignment results some kind of merging and interpreting strategy is required.
One such strategy was recently introduced by Schneeberger et al. in [45], called GenomeMapper. It performs simultaneous alignment of short reads against multiple genomes using a hash-based data structure [45], [37]. However, the accuracy of such alignments would be limited by the graph structure that captures the information from different strains of the same species. Also the method has the drawback that the reads are pairwise aligned to the graph structure and the comparison of reads with each other is not achieved. In this paper, we present a solution to the problem of aligning pyroreads from multiple genomes using a multiple alignment methodology on multiprocessor platforms.
In theory, alignment of multiple sequences can be achieved using pair-wise alignment, each pair getting alignment score and then maximizing the sum of all the pair-wise alignment scores. But for optimal alignment the sum of all the pair-wise alignment scores needs to be maximized, which is an NP complete problem [50]. Toward this end, dynamic programming based solutions of complexity have been pursued, where is the number of sequences and is the average length of a sequence. Such accurate optimizations are not practical for a large number of sequences as is the case in pyrosequencing, thus making heuristic algorithms as the only feasible option. The literature on these heuristics is vast and includes widely used works, such as Notredame et al. [36], Edgar [7], Thompson et al. [48], Do et al. [5], and Morgenstern [32].
Recently we introduced a multiple alignment system for pyrosequencing reads, known as pyro-align [42]. Although, the sequential algorithm is highly efficient compared to the other alignment algorithms, it is still not feasible to multiple align a large number of reads from multiple genomes on a single processor. It takes over 7 days to multiple align 200,000 pyroreads on single processor using this best known sequential algorithm [42], and the time increases drastically with the increase in number of reads.
In this paper, we present a parallel multiple alignment solution for aligning reads from multiple genomes. The objective of our work is to develop a high performance multiple alignment system for small error prone reads from multiple genomes strands of the same species, such that the errors in the alignment are ‘highlighted’ and the system is able to handle a large number of reads, as may be expected from pyrosequencing procedure. We emphasize that this problem is considerably harder and different compared to mapping of the reads to a single reference genome.
The multiple sequence alignment algorithm presented, distinct from pairwise mapping or genome reconstruction, is specifically designed for the alignment of reads from multiple genomes from different strains of the same species. The proposed algorithm is a combination of the domain decomposition concept introduced in [40], [41] and the exploitation of pyroreads characteristics for parallelization, therefore it is capable of aligning a very large number of pyroreads. It takes into account the position of the reads with respect to the reference genomes, and assigns weight to the leading and trailing gaps for the reads. After the decomposition, the reads are distributed among multiple processors using a metric calculated from the reads. The reads are multiple aligned on each processor with respect to each other and the genome. Then, the aligned reads on the processors are merged by exploiting the characteristics specific to pyro-reads. We assume that the reads may be generated from one or many genomes, with ‘forward’ orientation. We also assume that a wildtype of the species is available, as is generally the case for resequencing experiments. In our experiments, we have used HIV-pol gene virus as the reference genome (with length of 1971 bp) and simulator ReadSim [44] to generate these reads. Since the complexity for multiple alignments is dominated by the number of reads, the length of the genome has minimal effect in terms of memory and timings. Please note that the above genome is only used as a proof of concept for the technique. The importance of aligning reads from multiple genomes is discussed in the literature [37], [45] and is not stated here in the interest of brevity. We are able to align 200,000 reads on a 24 processor system in just 278 min. The alignment of same number of reads takes approximately 7 days on a single node system.
For the sake of completeness, let us first formally define the multiple sequence alignment problem in its generic form, without indulging with the issues such as scoring functions. Let sequences be presented as a set and let be the aligned sequence set, such that all the sequences in are of equal length, have maximum overlap, and the score of the global map is maximum according to some scoring mechanism suitable for the application. A perfect multiple alignment for pyroreads would be, that the reads are aligned with each other such that the position of the reads with respect to the reference genome(s) is conserved; the reads have maximum overlap (for conserved genome regions) and are of equal lengths after the alignment, including leading and trailing gaps.
Section snippets
Proposed parallel multiple sequence alignment algorithm for pyroreads: P-Pyro-Align
The intuitive idea behind the proposed P-Pyro-Align algorithm is to first place the reads in correct orientation with respect to the wildtype and then use progressive alignment to achieve the final alignment. The starting position of the reads w.r.t. the genome is referred to as -rank and is used to redistribute the reads on multiple processors. Regular sampling [14] based on -rank is used to achieve load-balancing among multiple processors. The final alignment is produced by combining the
Analysis of computation and communication costs
For the computation and communication analysis we use a coarse grained computing model such as -model [13], [12]. Also, for analysis purposes, we assume that the Clustalw System [48] is being used at each processor as the underlying profile–profile alignment system. It must be noted that the computation complexity of the alignment step will vary depending on the sequential MSA system used for alignment within each processor.
In the following analysis we assume that each processor has
Performance analysis
The performance evaluation process has been divided into two parts: the first part deals with the quality assessment, and the second part deals with traditional HPC metrics such as execution time, scalability, memory requirements, etc. The performance evaluation of the P-Pyro-Align algorithm is carried out on a Beowulf Cluster consisting of 24 Intel Xeon processors, each running at 3.0 GHz, with 512 KB cache and 2 GB DRAM memory. As for the interconnection network, the system uses Intel Gigabit
Conclusion
With rapidly increasing knowledge of variants, alignment of reads from multiple genomes would be of increasing importance. We have demonstrated that multiple alignment for a large number of error-prone reads is not only possible but it can also produce meaningful results. It gives access to regions that are highly divergent from the first reference or wildtype and would be rendered useless with mapping/pairwise alignments with the reference genome.
To this end, we have presented a highly
Acknowledgments
This work was completed when Fahad Saeed was a PhD candidate at the Department of Electrical and Computer Engineering, University of Illinois at Chicago, IL USA.
Fahad Saeed is a Research Fellow in the Epithelial Systems Biology Laboratory (ESBL) at National Heart Lung and Blood Institute (NHLBI), National Institutes of Health (NIH) Bethesda, Maryland. He was a postdoctoral fellow in the same lab from Aug 2010 to March 2011. He received his Ph.D. in the Department of Electrical and Computer Engineering, University of Illinois at Chicago in 2010. He has served as visiting scientist in the Department of Bio-Systems Science and Engineering (D-BSSE), ETH
References (53)
- et al.
Communication operations on coarse-grained mesh architectures
Parallel Comput.
(1995) - et al.
: a parallel model for coarse-grained machines
J. Parallel Distrib. Comput.
(1996) - et al.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
J. Mol. Biol.
(1970) - et al.
T-coffee: a novel method for multiple sequence alignments
J. Mol. Biol.
(2000) - et al.
A domain decomposition strategy for alignment of multiple biological sequences on multiprocessor platforms
J. Parallel Distrib. Comput.
(2009) - et al.
Personalized copy number and segmental duplication maps using next-generation sequencing
Nat. Genet.
(2009) - et al.
Pass: a program to align short sequences
Bioinformatics
(2009) - et al.
Common sequence polymorphisms shaping genetic diversity in arabidopsis thaliana
Science
(2007) - C. Coarfa, A. Milosavljevic, Pash 2.0: scaleable sequence anchoring for next-generation sequencing technologies, Pac....
- et al.
Probcons: probabilistic consistency-based multiple sequence alignment
Genome Res.
(2005)
Mom: maximum oligonucleotide mapping
Bioinformatics
Muscle: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res.
Muscle: a multiple sequence alignment method with reduced time and space complexity
BMC Bioinformatics
A comparison of scoring functions for protein sequence profile alignment
Bioinformatics
Viral population estimation using pyrosequencing
PLoS Comput. Biol.
Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology
Parallel sorting by regular sampling
J. Parallel Distrib. Comput.
Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes
Genome Res.
DNA sequencing: bench to bedside and beyond
Nucleic Acids Res.
Pyrosequencing analysis of the gyrb gene to differentiate bacteria responsible for diarrheal diseases
Eur. J. Clin. Microbiol. Infect. Dis.
Seqmap: mapping massive amount of oligonucleotides to the genome
Bioinformatics
The rapid generation of mutation data matrices from protein sequences
BMC Bioinformatics
Mafft: a novel method for rapid multiple sequence alignment based on fast Fourier transform
Nucleic Acids Res.
Introduction to Parallel Computing: Design and Analysis of Parallel Algorithms
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biol.
Cited by (10)
Parallelization of Global Sequence Alignment on Graphics Processing Unit
2020, Proceedings of the 2020 IEEE International Conference on Communications, Computing, Cybersecurity, and Informatics, CCCI 2020A work distribution strategy for global sequence alignment
2019, International Journal of ComputingMaSkal: Privacy preserving masked reads alignment using intel SGX
2018, Proceedings of the IEEE Symposium on Reliable Distributed SystemsBig data proteogenomics and high performance computing: Challenges and opportunities
2016, 2015 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2015
Fahad Saeed is a Research Fellow in the Epithelial Systems Biology Laboratory (ESBL) at National Heart Lung and Blood Institute (NHLBI), National Institutes of Health (NIH) Bethesda, Maryland. He was a postdoctoral fellow in the same lab from Aug 2010 to March 2011. He received his Ph.D. in the Department of Electrical and Computer Engineering, University of Illinois at Chicago in 2010. He has served as visiting scientist in the Department of Bio-Systems Science and Engineering (D-BSSE), ETH Zürich and Swiss Institute of Bioinformatics (SIB) in 2008 and 2009 respectively. Dr. Saeed is the recipient of numerous national and international awards and honors. He is serving as the Program co-chair of the 4th Bioinformatics and Computational Biology conference (BICoB), 2012. He has been nominated for the best dissertation award of the year at the University of Illinois in the year 2010. He was the recipient of NIH postdoctoral fellowship for the year 2010–2011 and recipient of ThinkSwiss fellowship for the years 2007 and 2008 by the government of Switzerland. His research interests include parallel and distributed algorithms and architectures, proteomics and genomics and parallel algorithms for bioinformatics applications. He is a member of IEEE and ACM.
Alan Perez-Rathke received his B.S. and M.S. degrees in computer science from the University of Illinois at Urbana–Champaign and Chicago campuses respectively. His thesis work has largely focused on applications of high performance computing to the field of bioinformatics. Prior to becoming a graduate student, Alan worked for several years as a software engineer at a large video game company. He now wants to leverage his technical skills to help promote research in genomics, proteomics, and medicine—particularly in the area of cancer. Currently, he hopes to fulfill his goal of combining computer science and medicine through acceptance into a combined MD/Ph.D. program.
Jaroslaw Gwarnicki received his B.S. and M.S. degrees in Computer Science from the University of Illinois at Chicago. Ever since being awarded Bachelor Degree he has been employed at a major gaming industry company where as a Senior Software Engineer he specializes in the area of optimizations and parallel computing. His plans for future involve applying the skills and ideas obtained through the presented research to improve and advance the game development industry.
Tanya Berger-Wolf is an Associate Professor in the Department of Computer Science at the University of Illinois at Chicago where she heads the Computational Population Biology Lab. Her research interests are in applications of computational techniques to problems in population biology of plants, animals, and humans, from genetics to social interactions. Dr. Berger-Wolf has received her Ph.D. in Computer Science from University of Illinois at Urbana–Champaign in 2002. She has spent two years as a postdoctoral fellow at the University of New Mexico working in computational phylogenetics and a year at the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS) doing research in computational epidemiology. She has received numerous awards for her research and mentoring, including the US National Science Foundation CAREER Award in 2008 and the UIC Mentor of the Year Award in 2009.
Ashfaq Khokhar received his Ph.D. in Computer Engineering from the University of Southern California, in 1993. After his Ph.D., he spent two years as a Visiting Assistant Professor in the Department of Computer Sciences and School of Electrical and Computer Engineering at Purdue University. In 1995, he joined the Department of Electrical and Computer Engineering at the University of Delaware, where he first served as Assistant Professor and then as Associate Professor. In Fall 2000, Dr. Khokhar joined UIC in the Department of Computer Science and Department of Electrical and Computer Engineering, where he currently serves on the rank of Professor. Dr. Khokhar has published over 200 technical papers and book chapters in refereed conferences and journals in the areas of wireless networks, multimedia systems, data mining, and high performance computing. He is a recipient of the NSF CAREER award in 1998. His paper entitled “Scalable S-to-P Broadcasting in Message Passing MPPs” has won the Outstanding Paper award in the International Conference on Parallel Processing in 1996. He has served as the Program Chair of the 17th Parallel and Distributed Computing Conference (PDCS), 2004, Vice Program Chair for the 33rd International Conference on Parallel Processing (ICPP), 2004, and Program Chair of the IEEE Globecom Symposium on Ad hoc and Sensor Networks, 2009. He is a Fellow of IEEE for his contributions to multimedia computing and databases. His research interests include: wireless and sensor networks, multimedia systems, data mining, and high performance computing.