SMusket: Spark-based DNA error correction on distributed-memory systems
Introduction
In recent years, the volume of biological data has increased exponentially due to significant advances in throughput and cost of Next-Generation Sequencing (NGS) platforms [1]. These advances are providing new opportunities to researchers to better understand genetic variation among individuals, helping to characterize complicated diseases like cancer at the genomic level. Nowadays, NGS technologies are able to generate up to terabytes of raw data in a single sequencing run, and this trend is expected to continue to increase in the coming years [2]. Apart from lower cost and increased throughput, NGS technologies also introduce, as a downside, higher error rates in the DNA sequence fragments (so-called reads) compared to traditional Sanger sequencing methods [3], which degrades the quality of downstream analysis and complicates the data processing for many biological studies such as de novo genome assembly [4] or short-read mapping [5]. Therefore, an important but computationally intensive and time-consuming preprocessing step in many NGS pipelines is read error correction, which improves not only the quality of downstream analysis but also the accuracy and speed of all the tools used in the pipeline.
The explosion in the amount of available biological data is introducing heavy computational and storage challenges on current systems. Many data analysis pipelines require significant runtimes to transform raw data into valuable information for clinical diagnosis and discovery. Correcting sequencing errors in massive NGS datasets in reasonable time can be tackled by relying on parallel computing techniques. However, most of the existing state-of-the-art error correction tools are limited to shared-memory systems or they require specific hardware devices or features. The emergence of Big Data technologies such as the MapReduce paradigm introduced by Google [6] has enabled the deployment of large applications on distributed-memory systems built using commodity off-the-shelf hardware, which can be executed in a highly scalable manner. As a consequence, Big Data and cloud computing have gained much attention in bioinformatics and biomedical fields when dealing with challenges posed by abundant biological data [7], [8], [9], [10], [11].
In this paper we present SparkMusket (SMusket), an error correction tool built upon the open-source Apache Spark framework [12] to exploit the parallel capabilities of Big Data technologies on distributed-memory systems. Spark is a popular cluster computing framework that supports efficient in-memory computations by relying on a distributed-memory abstraction known as Resilient Distributed Datasets (RDD) [13], which provide data parallelism and fault tolerance implicitly. Our tool reimplements on top of the Spark programming model an accurate error correction algorithm from Musket [14], a top-performing and widely used multithreaded corrector based on the -mer spectrum-based method [15] that provides three correction techniques in a multistage workflow. SMusket currently supports the processing of both single- and paired-end DNA reads stored in standard unaligned sequence formats (FASTQ/FASTA). The main contributions of this paper are:
- •
A thorough literature review and taxonomy on error correction methods and tools for DNA reads.
- •
A detailed description of SMusket, a distributed error correction tool that efficiently takes advantage of several Spark features (e.g., RDDs, broadcast variables) to fully exploit the performance of Big Data clusters.
- •
An extensive experimental evaluation of SMusket on a 16-node cluster using four publicly available real datasets that demonstrates the performance benefits of the proposed Spark-based algorithm when compared to previous state-of-the-art MPI- and multithreaded-based tools.
The remainder of the paper is structured as follows: Section 2 introduces the background of the paper. Section 3 discusses the related work. The design and implementation of our tool is described in Section 4. Section 5 presents the experimental results carried out on a Spark cluster to assess the performance of SMusket together with a comparison with representative related tools. Finally, Section 6 concludes the paper and proposes future work.
Section snippets
Background
This section introduces the main concepts and technologies involved that are necessary to understand our proposal: the -mer spectrum-based (or -spectrum-based) correction method (Section 2.1), the MapReduce model (Section 2.2) and the Apache Spark framework (Section 2.3).
Related work
The exploitation of Big Data technologies such as MapReduce and Spark to accelerate the storage, processing and visualization of large datasets has transformed multiple disciplines through the new knowledge these technologies help to generate. Representative examples in the literature include several fields such as weather forecasting [48], healthcare [49], medical imaging simulation [50], social networks [51], industrial IoT [52] and deep learning [53]. In the particular case of the
SMusket implementation
Musket is a command-line tool implemented in C++ and parallelized using threads [14]. Unfortunately, Spark does not support C++ to write the Driver program, which would have greatly facilitated our implementation. Among the currently supported languages (Scala, Java, Python and R), we have selected Java to implement SMusket in order to ease code conversion from C++ due to their comparable objected-oriented models and some syntax similarities. However, it is important to remark that although the
Performance evaluation
As mentioned in Section 4.1, the quality of error correction provided by SMusket remains the same as that of Musket, whose accuracy has already been thoroughly assessed in multiple previous studies [14], [24], [29], [60]. Therefore, the experimental evaluation of SMusket has focused on performance in terms of execution time, as our main goal is to provide a scalable parallel tool. In order to evaluate performance and scalability, a 16-node commodity cluster running Spark version 2.3.1 has been
Conclusions
The massive amount of data produced by modern NGS technologies reinforces the need for scalable tools with the ability to perform parallel computations by taking advantage of distributed-memory systems. In this paper we have presented SMusket, a Big Data tool that fully exploits the features of Apache Spark to boost the performance of Musket, a popular and accurate DNA read error corrector. Our tool extends the correction capabilities of Musket to distributed-memory systems obtaining high
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain and FEDER, Spain funds of the European Union (project TIN2016-75845-P, AEI/FEDER/EU); and by Xunta de Galicia, Spain (projects ED431G/01 and ED431C 2017/04).
Roberto R. Expósito received the B.S. (2010), M.S. (2011) and Ph.D. (2014) degrees in Computer Science from the Universidade da Coruña (UDC), Spain. He is currently an Assistant Professor in the Department of Computer Engineering at UDC. His main research interests are in the area of HPC and Big Data, focused on the performance optimization of distributed processing models in HPC and cloud environments, and the parallelization of bioinformatics and data mining applications. His homepage is //gac.udc.es/%7Erober
References (75)
Assessing the value of next-generation sequencing technologies: an introduction
Value Health
(2018)- et al.
Scalable and efficient whole-exome data processing using workflows on the cloud
Future Gener. Comput. Syst.
(2016) - et al.
‘Big data’, Hadoop and cloud computing in genomics
J. Biomed. Inform.
(2013) A cost-effective approach to improving performance of big genomic data analyses in clouds
Future Gener. Comput. Syst.
(2017)- et al.
DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI
BMC Bioinformatics
(2011) Towards data analysis for weather cloud computing
Knowl. Based Syst.
(2017)- et al.
Big data analytics: understanding its capabilities and potential benefits for healthcare organizations
Technol. Forecast. Soc. Change
(2018) - et al.
Social networking big data: opportunities, solutions, and challenges
Future Gener. Comput. Syst.
(2018) - et al.
The role of big data analytics in industrial internet of things
Future Gener. Comput. Syst.
(2019) - et al.
BDEv 3.0: energy efficiency and microarchitectural characterization of big data processing frameworks
Future Gener. Comput. Syst.
(2018)
Big data: astronomical or genomical?
PLoS Biol.
Performance comparison of whole-genome sequencing platforms
Nat. Biotechnol.
Limitations of next-generation genome sequence assembly
Nature Methods
Fast and accurate short read alignment with Burrows-Wheeler transform
Bioinformatics
MapReduce: simplified data processing on large clusters
Commun. ACM
Survey of MapReduce frame operation in bioinformatics
Brief Bioinform.
Big data application in biomedical research and health care: a literature review
Biomed. Inform. Insights
Apache Spark: a unified engine for big data processing
Commun. ACM
Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data
Bioinformatics
Fragment assembly with short reads
Bioinformatics
A survey of error-correction methods for next-generation sequencing
Brief Bioinform.
Correcting illumina data
Brief Bioinform.
Reptile: representative tiling for short read error correction
Bioinformatics
Quake: quality-aware detection and correction of sequencing errors
Genome Biol.
A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware
J. Comput. Biol.
Efficient de novo assembly of large genomes using compressed data structures
Genome Res.
RACER: rapid and accurate correction of errors in reads
Bioinformatics
Lighter: fast and memory-efficient sequencing error correction without counting
Genome Biol.
BLESS: bloom filter-based error correction solution for high-throughput sequencing reads
Bioinformatics
BLESS 2: accurate, memory-efficient and fast error correction method
Bioinformatics
BFC: correcting illumina sequencing errors
Bioinformatics
RECKONER: read error corrector based on KMC
Bioinformatics
Mining statistically-solid k-mers for accurate NGS error correction
BMC Genom.
SHREC: a short-read error correction method
Bioinformatics
Cited by (0)
Roberto R. Expósito received the B.S. (2010), M.S. (2011) and Ph.D. (2014) degrees in Computer Science from the Universidade da Coruña (UDC), Spain. He is currently an Assistant Professor in the Department of Computer Engineering at UDC. His main research interests are in the area of HPC and Big Data, focused on the performance optimization of distributed processing models in HPC and cloud environments, and the parallelization of bioinformatics and data mining applications. His homepage is http://gac.udc.es/ rober.
Jorge González-Domínguez received the B.S. (2008), M.S. (2009) and Ph.D. (2013) degrees in Computer Science from the Universidade da Coruña (UDC), Spain. He is currently an Assistant Professor in the Department of Computer Engineering at UDC. His main research interest is the development of parallel applications on fields such as bioinformatics, data mining and machine learning, focused on different architectures (multicore systems, GPUs, clusters, etc.). His homepage is http://gac.udc.es/ jorgeg.
Juan Touriño received the B.S. (1993), M.S. (1993) and Ph.D. (1998) degrees in Computer Science from the Universidade da Coruña (UDC), Spain. In 1993, he joined the Department of Computer Engineering at UDC, where he is currently a Full Professor and Head of the Computer Architecture Group. He has extensively published in the area of parallel and distributed computing, currently focusing on the convergence of HPC and Big Data. He is coauthor of more than 160 technical papers in this area. He has also served in the PC of 70 international conferences. His homepage is http://gac.udc.es/ juan.