MaRe: Processing Big Data with application containers on Apache Spark.,GigaScience

当前位置： X-MOL 学术 › Gigascience › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MaRe: Processing Big Data with application containers on Apache Spark.
GigaScience ( IF 11.8 ) Pub Date : 2020-05-01 , DOI: 10.1093/gigascience/giaa042
Marco Capuccini _{1,

2} , Martin Dahlö _{2,

3,

4} , Salman Toor ₁ , Ola Spjuth ₂

Affiliation

BACKGROUND Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. RESULTS Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. CONCLUSIONS MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.

中文翻译：

MaRe：使用 Apache Spark 上的应用程序容器处理大数据。

背景技术生命科学日益受到大数据分析的驱动，并且 MapReduce 编程模型已被证明对于数据密集型分析是成功的。然而，当前的 MapReduce 框架对于在生物信息学管道中重用现有处理工具提供的支持很差。此外，这些框架没有对应用程序容器的本机支持，而应用程序容器在科学数据处理中正变得越来越流行。结果在这里，我们介绍 MaRe，一个开源编程库，它引入了对 Apache Spark 中的 Docker 容器的支持。Apache Spark和Docker是聚集了最大开源社区的MapReduce框架和容器引擎；因此，MaRe 提供了与尖端软件生态系统的互操作性。我们在生命科学领域的 2 个数据密集型应用程序上演示了 MaRe，展示了易用性和可扩展性。结论 MaRe 通过 Apache Spark 和应用程序容器在生命科学领域实现可扩展的数据密集型处理。与当前涉及使用工作流系统的最佳实践相比，MaRe 的优势在于提供数据局部性、异构存储系统的摄取以及交互式处理。MaRe 普遍适用并可作为开源软件使用。

更新日期：2020-05-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11