当前位置: X-MOL 学术Distrib. Parallel. Databases › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Abstract cost models for distributed data-intensive computations
Distributed and Parallel Databases ( IF 1.2 ) Pub Date : 2018-08-24 , DOI: 10.1007/s10619-018-7244-2
Rundong Li 1 , Ningfang Mi 2 , Mirek Riedewald 1 , Yizhou Sun 3 , Yi Yao 2
Affiliation  

We consider data analytics workloads on distributed architectures, in particular clusters of commodity machines. To find a job partitioning that minimizes running time, a cost model, which we more accurately refer to as makespan model, is needed. In attempting to find the simplest possible, but sufficiently accurate, such model, we explore piecewise linear functions of input, output, and computational complexity. They are abstract in the sense that they capture fundamental algorithm properties, but do not require explicit modeling of system and implementation details such as the number of disk accesses. We show how the simplified functional structure can be exploited to reduce optimization cost. In the general case, we identify a lower bound that can be used for search-space pruning. For applications with homogeneous tasks, we further demonstrate how to directly integrate the model into the makespan optimization process, reducing search-space dimensionality and thus complexity by orders of magnitude. Experimental results provide evidence of good prediction quality and successful makespan optimization across a variety of operators and cluster architectures.

中文翻译:

分布式数据密集型计算的抽象成本模型

我们考虑分布式架构上的数据分析工作负载,特别是商品机器集群。为了找到最小化运行时间的作业分区,需要一个成本模型,我们更准确地将其称为 makespan 模型。在试图找到最简单但足够准确的模型时,我们探索了输入、输出和计算复杂度的分段线性函数。它们在捕获基本算法属性的意义上是抽象的,但不需要对系统和实现细节(例如磁盘访问次数)进行显式建模。我们展示了如何利用简化的功能结构来降低优化成本。在一般情况下,我们确定可用于搜索空间修剪的下限。对于具有同类任务的应用程序,我们进一步演示了如何将模型直接集成到 makespan 优化过程中,从而将搜索空间维度和复杂性降低几个数量级。实验结果提供了良好预测质量和跨各种运营商和集群架构成功优化的证据。
更新日期:2018-08-24
down
wechat
bug