Efficient Performance Prediction for Apache Spark,Journal of Parallel and Distributed Computing

当前位置： X-MOL 学术 › J. Parallel Distrib. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient Performance Prediction for Apache Spark
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2020-11-17 , DOI: 10.1016/j.jpdc.2020.10.010
Guoli Cheng , Shi Ying , Bingming Wang , Yuhang Li

Spark is a more efficient distributed big data processing framework following Hadoop. It provides users with more than 180 adjustable configuration parameters, and how to choose the optimal configuration automatically to make the Spark application run effectively is challenging. The key to address the above challenge is having the ability to predict the performance of Spark applications in different configurations. This paper proposes a new approach based on Adaboost, which can efficiently and accurately predict the performance of a given application with a given Spark configuration. In our approach, Adaboost is used to build a set of performance models at the stage-level for Spark. To minimize the overhead of the modeling, we use the classic projective sampling, a data mining technique that allows us to collect as few training samples as possible while meeting the accuracy requirements. We evaluate the proposed approach on six typical Spark benchmarks with five input datasets. The experimental results show that our approach is less than the previously proposed approach in prediction error and cost.

中文翻译：

Apache Spark的有效性能预测

Spark是继Hadoop之后的一种更高效的分布式大数据处理框架。它为用户提供了180多个可调整的配置参数，如何自动选择最佳配置以使Spark应用程序有效运行具有挑战性。解决上述挑战的关键是能够预测不同配置中Spark应用程序的性能。本文提出了一种基于Adaboost的新方法，该方法可以有效，准确地预测具有给定Spark配置的给定应用程序的性能。在我们的方法中，Adaboost用于在Spark的阶段级构建一组性能模型。为了最大程度地减少建模的开销，我们使用经典的投影采样，一种数据挖掘技术，它使我们可以在满足准确性要求的同时收集尽可能少的训练样本。我们在具有五个输入数据集的六个典型Spark基准上评估了该方法。实验结果表明，我们的方法在预测误差和成本上均小于先前提出的方法。

更新日期：2020-12-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11