当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SMusket: Spark-based DNA error correction on distributed-memory systems
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2019-10-31 , DOI: 10.1016/j.future.2019.10.038
Roberto R. Expósito , Jorge González-Domínguez , Juan Touriño

Next-Generation Sequencing (NGS) technologies have revolutionized genomics research over the last decade, bringing new opportunities for scientists to perform groundbreaking biological studies. Error correction in NGS datasets is considered an important preprocessing step in many workflows as sequencing errors can severely affect the quality of downstream analysis. Although current error correction approaches provide reasonably high accuracies, their computational cost can be still unacceptable when processing large datasets. In this paper we propose SparkMusket (SMusket), a Big Data tool built upon the open-source Apache Spark cluster computing framework to boost the performance of Musket, one of the most widely adopted and top-performing multithreaded correctors. Our tool efficiently exploits Spark features to implement a scalable error correction algorithm intended for distributed-memory systems built using commodity hardware. The experimental evaluation on a 16-node cluster using four publicly available datasets has shown that SMusket is up to 15.3 times faster than previous state-of-the-art MPI-based tools, also providing a maximum speedup of 29.8 over its multithreaded counterpart. SMusket is publicly available under an open-source license at https://github.com/rreye/smusket.



中文翻译:

SMusket:分布式内存系统上基于Spark的DNA错误校正

在过去的十年中,下一代测序(NGS)技术彻底改变了基因组学研究,为科学家开展具有开创性的生物学研究提供了新的机遇。NGS数据集中的错误校正在许多工作流程中被认为是重要的预处理步骤,因为测序错误会严重影响下游分析的质量。尽管当前的纠错方法提供了相当高的准确性,但是在处理大型数据集时其计算成本仍然是无法接受的。在本文中,我们提出了SparkMusket(SMusket),这是一种基于开源Apache Spark集群计算框架的大数据工具,旨在提高Musket(最广泛采用和性能最高的多线程校正器)的性能。我们的工具有效利用Spark功能来实现可扩展的纠错算法,该算法适用于使用商品硬件构建的分布式内存系统。在使用四个公共数据集的16节点群集上进行的实验评估表明,SMusket的速度比以前基于MPI的最新工具快了15.3倍,并且与多线程同类工具相比,其最高提速为29.8。SMusket可通过开源许可在https://github.com/rreye/smusket上公开获得。8在其多线程对应对象上。SMusket可通过开源许可在https://github.com/rreye/smusket上公开获得。8在其多线程对应对象上。SMusket可通过开源许可在https://github.com/rreye/smusket上公开获得。

更新日期:2019-10-31
down
wechat
bug