SMusket: Spark-based DNA error correction on distributed-memory systems

https://doi.org/10.1016/j.future.2019.10.038Get rights and content

Highlights

  • Big Data tool for efficient DNA read error correction on distributed-memory systems.

  • Scalable Spark implementation of a k-spectrum algorithm based on Musket.

  • SMusket shows speedups of up to 29.8x over Musket on a 16-node cluster.

  • SMusket is up to 15.3 times faster compared with state-of-the-art MPI-based tools.

Abstract

Next-Generation Sequencing (NGS) technologies have revolutionized genomics research over the last decade, bringing new opportunities for scientists to perform groundbreaking biological studies. Error correction in NGS datasets is considered an important preprocessing step in many workflows as sequencing errors can severely affect the quality of downstream analysis. Although current error correction approaches provide reasonably high accuracies, their computational cost can be still unacceptable when processing large datasets. In this paper we propose SparkMusket (SMusket), a Big Data tool built upon the open-source Apache Spark cluster computing framework to boost the performance of Musket, one of the most widely adopted and top-performing multithreaded correctors. Our tool efficiently exploits Spark features to implement a scalable error correction algorithm intended for distributed-memory systems built using commodity hardware. The experimental evaluation on a 16-node cluster using four publicly available datasets has shown that SMusket is up to 15.3 times faster than previous state-of-the-art MPI-based tools, also providing a maximum speedup of 29.8 over its multithreaded counterpart. SMusket is publicly available under an open-source license at https://github.com/rreye/smusket.

Introduction

In recent years, the volume of biological data has increased exponentially due to significant advances in throughput and cost of Next-Generation Sequencing (NGS) platforms [1]. These advances are providing new opportunities to researchers to better understand genetic variation among individuals, helping to characterize complicated diseases like cancer at the genomic level. Nowadays, NGS technologies are able to generate up to terabytes of raw data in a single sequencing run, and this trend is expected to continue to increase in the coming years [2]. Apart from lower cost and increased throughput, NGS technologies also introduce, as a downside, higher error rates in the DNA sequence fragments (so-called reads) compared to traditional Sanger sequencing methods [3], which degrades the quality of downstream analysis and complicates the data processing for many biological studies such as de novo genome assembly [4] or short-read mapping [5]. Therefore, an important but computationally intensive and time-consuming preprocessing step in many NGS pipelines is read error correction, which improves not only the quality of downstream analysis but also the accuracy and speed of all the tools used in the pipeline.

The explosion in the amount of available biological data is introducing heavy computational and storage challenges on current systems. Many data analysis pipelines require significant runtimes to transform raw data into valuable information for clinical diagnosis and discovery. Correcting sequencing errors in massive NGS datasets in reasonable time can be tackled by relying on parallel computing techniques. However, most of the existing state-of-the-art error correction tools are limited to shared-memory systems or they require specific hardware devices or features. The emergence of Big Data technologies such as the MapReduce paradigm introduced by Google [6] has enabled the deployment of large applications on distributed-memory systems built using commodity off-the-shelf hardware, which can be executed in a highly scalable manner. As a consequence, Big Data and cloud computing have gained much attention in bioinformatics and biomedical fields when dealing with challenges posed by abundant biological data [7], [8], [9], [10], [11].

In this paper we present SparkMusket (SMusket), an error correction tool built upon the open-source Apache Spark framework [12] to exploit the parallel capabilities of Big Data technologies on distributed-memory systems. Spark is a popular cluster computing framework that supports efficient in-memory computations by relying on a distributed-memory abstraction known as Resilient Distributed Datasets (RDD) [13], which provide data parallelism and fault tolerance implicitly. Our tool reimplements on top of the Spark programming model an accurate error correction algorithm from Musket [14], a top-performing and widely used multithreaded corrector based on the k-mer spectrum-based method [15] that provides three correction techniques in a multistage workflow. SMusket currently supports the processing of both single- and paired-end DNA reads stored in standard unaligned sequence formats (FASTQ/FASTA). The main contributions of this paper are:

  • A thorough literature review and taxonomy on error correction methods and tools for DNA reads.

  • A detailed description of SMusket, a distributed error correction tool that efficiently takes advantage of several Spark features (e.g., RDDs, broadcast variables) to fully exploit the performance of Big Data clusters.

  • An extensive experimental evaluation of SMusket on a 16-node cluster using four publicly available real datasets that demonstrates the performance benefits of the proposed Spark-based algorithm when compared to previous state-of-the-art MPI- and multithreaded-based tools.

The remainder of the paper is structured as follows: Section 2 introduces the background of the paper. Section 3 discusses the related work. The design and implementation of our tool is described in Section 4. Section 5 presents the experimental results carried out on a Spark cluster to assess the performance of SMusket together with a comparison with representative related tools. Finally, Section 6 concludes the paper and proposes future work.

Section snippets

Background

This section introduces the main concepts and technologies involved that are necessary to understand our proposal: the k-mer spectrum-based (or k-spectrum-based) correction method (Section 2.1), the MapReduce model (Section 2.2) and the Apache Spark framework (Section 2.3).

Related work

The exploitation of Big Data technologies such as MapReduce and Spark to accelerate the storage, processing and visualization of large datasets has transformed multiple disciplines through the new knowledge these technologies help to generate. Representative examples in the literature include several fields such as weather forecasting [48], healthcare [49], medical imaging simulation [50], social networks [51], industrial IoT [52] and deep learning [53]. In the particular case of the

SMusket implementation

Musket is a command-line tool implemented in C++ and parallelized using threads [14]. Unfortunately, Spark does not support C++ to write the Driver program, which would have greatly facilitated our implementation. Among the currently supported languages (Scala, Java, Python and R), we have selected Java to implement SMusket in order to ease code conversion from C++ due to their comparable objected-oriented models and some syntax similarities. However, it is important to remark that although the

Performance evaluation

As mentioned in Section 4.1, the quality of error correction provided by SMusket remains the same as that of Musket, whose accuracy has already been thoroughly assessed in multiple previous studies [14], [24], [29], [60]. Therefore, the experimental evaluation of SMusket has focused on performance in terms of execution time, as our main goal is to provide a scalable parallel tool. In order to evaluate performance and scalability, a 16-node commodity cluster running Spark version 2.3.1 has been

Conclusions

The massive amount of data produced by modern NGS technologies reinforces the need for scalable tools with the ability to perform parallel computations by taking advantage of distributed-memory systems. In this paper we have presented SMusket, a Big Data tool that fully exploits the features of Apache Spark to boost the performance of Musket, a popular and accurate DNA read error corrector. Our tool extends the correction capabilities of Musket to distributed-memory systems obtaining high

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain and FEDER, Spain funds of the European Union (project TIN2016-75845-P, AEI/FEDER/EU); and by Xunta de Galicia, Spain (projects ED431G/01 and ED431C 2017/04).

Roberto R. Expósito received the B.S. (2010), M.S. (2011) and Ph.D. (2014) degrees in Computer Science from the Universidade da Coruña (UDC), Spain. He is currently an Assistant Professor in the Department of Computer Engineering at UDC. His main research interests are in the area of HPC and Big Data, focused on the performance optimization of distributed processing models in HPC and cloud environments, and the parallelization of bioinformatics and data mining applications. His homepage is //gac.udc.es/%7Erober

References (75)

  • StephensZ.D.

    Big data: astronomical or genomical?

    PLoS Biol.

    (2015)
  • LamH.Y.

    Performance comparison of whole-genome sequencing platforms

    Nat. Biotechnol.

    (2012)
  • AlkanC. et al.

    Limitations of next-generation genome sequence assembly

    Nature Methods

    (2011)
  • LiH. et al.

    Fast and accurate short read alignment with Burrows-Wheeler transform

    Bioinformatics

    (2009)
  • DeanJ. et al.

    MapReduce: simplified data processing on large clusters

    Commun. ACM

    (2008)
  • ZouQ. et al.

    Survey of MapReduce frame operation in bioinformatics

    Brief Bioinform.

    (2013)
  • LuoJ. et al.

    Big data application in biomedical research and health care: a literature review

    Biomed. Inform. Insights

    (2016)
  • ZahariaM.

    Apache Spark: a unified engine for big data processing

    Commun. ACM

    (2016)
  • M. Zaharia, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in:...
  • LiuY. et al.

    Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data

    Bioinformatics

    (2013)
  • ChaissonM. et al.

    Fragment assembly with short reads

    Bioinformatics

    (2004)
  • YangX. et al.

    A survey of error-correction methods for next-generation sequencing

    Brief Bioinform.

    (2012)
  • MolnarM. et al.

    Correcting illumina data

    Brief Bioinform.

    (2014)
  • YangX. et al.

    Reptile: representative tiling for short read error correction

    Bioinformatics

    (2010)
  • KelleyD.R. et al.

    Quake: quality-aware detection and correction of sequencing errors

    Genome Biol.

    (2010)
  • ShiH. et al.

    A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware

    J. Comput. Biol.

    (2010)
  • SimpsonJ.T. et al.

    Efficient de novo assembly of large genomes using compressed data structures

    Genome Res.

    (2012)
  • IlieL. et al.

    RACER: rapid and accurate correction of errors in reads

    Bioinformatics

    (2013)
  • SongL. et al.

    Lighter: fast and memory-efficient sequencing error correction without counting

    Genome Biol.

    (2014)
  • HeoY. et al.

    BLESS: bloom filter-based error correction solution for high-throughput sequencing reads

    Bioinformatics

    (2014)
  • HeoY. et al.

    BLESS 2: accurate, memory-efficient and fast error correction method

    Bioinformatics

    (2016)
  • LiH.

    BFC: correcting illumina sequencing errors

    Bioinformatics

    (2015)
  • A. Ramachandran, Y. Heo, W.-M. Hwu, J. Ma, D. Chen, FPGA accelerated DNA error correction, in: Proceedings of the 2015...
  • DługoszM. et al.

    RECKONER: read error corrector based on KMC

    Bioinformatics

    (2017)
  • K. Xu, et al. SPECTR: scalable parallel short read error correction on multi-core and many-core architectures, in:...
  • ZhaoL.

    Mining statistically-solid k-mers for accurate NGS error correction

    BMC Genom.

    (2018)
  • SchröderJ. et al.

    SHREC: a short-read error correction method

    Bioinformatics

    (2009)
  • Cited by (0)

    Roberto R. Expósito received the B.S. (2010), M.S. (2011) and Ph.D. (2014) degrees in Computer Science from the Universidade da Coruña (UDC), Spain. He is currently an Assistant Professor in the Department of Computer Engineering at UDC. His main research interests are in the area of HPC and Big Data, focused on the performance optimization of distributed processing models in HPC and cloud environments, and the parallelization of bioinformatics and data mining applications. His homepage is http://gac.udc.es/ rober.

    Jorge González-Domínguez received the B.S. (2008), M.S. (2009) and Ph.D. (2013) degrees in Computer Science from the Universidade da Coruña (UDC), Spain. He is currently an Assistant Professor in the Department of Computer Engineering at UDC. His main research interest is the development of parallel applications on fields such as bioinformatics, data mining and machine learning, focused on different architectures (multicore systems, GPUs, clusters, etc.). His homepage is http://gac.udc.es/ jorgeg.

    Juan Touriño received the B.S. (1993), M.S. (1993) and Ph.D. (1998) degrees in Computer Science from the Universidade da Coruña (UDC), Spain. In 1993, he joined the Department of Computer Engineering at UDC, where he is currently a Full Professor and Head of the Computer Architecture Group. He has extensively published in the area of parallel and distributed computing, currently focusing on the convergence of HPC and Big Data. He is coauthor of more than 160 technical papers in this area. He has also served in the PC of 70 international conferences. His homepage is http://gac.udc.es/ juan.

    View full text