A gray-box modeling methodology for runtime prediction of Apache Spark jobs,Distributed and Parallel Databases

当前位置： X-MOL 学术 › Distrib. Parallel. Databases › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A gray-box modeling methodology for runtime prediction of Apache Spark jobs
Distributed and Parallel Databases ( IF 1.5 ) Pub Date : 2020-03-10 , DOI: 10.1007/s10619-020-07286-y
Hani Al-Sayeh , Stefan Hagedorn , Kai-Uwe Sattler

Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.

中文翻译：

一种用于 Apache Spark 作业运行时预测的灰盒建模方法

Apache Spark 作业的特点通常是处理庞大的数据集，因此需要几分钟到几小时的运行时间。因此，能够预测此类作业的运行时间不仅有助于了解作业何时完成，还可用于调度目的、估算云部署的货币成本或确定适当的集群配置，例如数量的节点。然而，预测 Spark 作业运行时比标准数据库查询更具挑战性：集群配置和参数对性能有显着影响，作业通常包含大量用户定义的代码，因此难以估计基数和执行成本。在本文中，我们提出了一种用于 Apache Spark 作业运行时预测的灰盒建模方法。我们的方法包括两个步骤：首先，用于预测每个算子的输入 RDD 的基数的白盒模型是基于关于行为和应用参数（例如应用的过滤器数据、迭代次数等）的先验知识构建的。在第二步中，黑盒模型对于通过监视运行时指标构建的每个任务，同时使用不同的分配资源和输入 RDD 基数。我们进一步展示了如何使用这种灰盒方法不仅用于预测给定作业的运行时间，而且还作为决策模型的一部分，用于重用 Spark 作业的中间缓存结果。我们的方法经过实验评估验证，显示对实际作业运行时间的高度准确预测，如果可以重用中间结果，则性能会有所提高。

更新日期：2020-03-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11