A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench,Journal of Big Data

当前位置： X-MOL 学术 › J. Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
Journal of Big Data ( IF 8.1 ) Pub Date : 2020-12-14 , DOI: 10.1186/s40537-020-00388-5
N. Ahmed , Andre L. C. Barczak , Teo Susnjak , Mohammed A. Rashid

Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

中文翻译：

使用HiBench针对大型数据集的Apache Hadoop和Apache Spark的全面性能分析

用于存储，处理和分析大规模数据集的大数据分析已成为行业必不可少的工具。诸如Hadoop和Spark之类的分布式计算框架的出现为分析大量数据提供了有效的解决方案。由于应用程序编程接口（API）的可用性及其性能，Spark变得非常受欢迎，甚至比MapReduce框架更受欢迎。这两个框架都有超过150个参数，这些参数的组合对集群性能有很大的影响。默认系统参数可以帮助系统管理员轻松部署系统应用程序，并且可以使用出厂设置的参数来衡量其特定的群集性能。但是，还有一个悬而未决的问题：新的参数选择可以改善大型数据集的群集性能吗？在这方面，本研究调查了在资源利用，输入拆分和混洗下影响最大的参数，以使用我们实验室中的已实现集群比较Hadoop和Spark之间的性能。我们基于大量实验使用了试错法来调整这些参数。为了评估比较分析的框架，我们选择两个工作负载：WordCount和TeraSort。性能指标基于以下三个标准执行：执行时间，吞吐量和加速。我们的实验结果表明，两种系统的性能在很大程度上取决于输入数据的大小和正确的参数选择。

更新日期：2020-12-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>