当前位置: X-MOL 学术Int. J. Parallel. Program › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Handling Data Skew for Aggregation in Spark SQL Using Task Stealing
International Journal of Parallel Programming ( IF 0.9 ) Pub Date : 2020-03-25 , DOI: 10.1007/s10766-020-00657-z
Zeyu He , Qiuli Huang , Zhifang Li , Chuliang Weng

In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which makes data skew common in reality. Traditional approaches to handling data skew by sampling and repartitioning often incur additional overhead. In this paper, we proposed a dynamic execution optimization for the aggregation operator, which is one of the most general and expensive operators in Spark SQL. Our optimization aims to avoid the additional overhead and improve the performance when data skew occurs. The core idea is task stealing . Based on the relative size of data partitions, we add two types of tasks, namely segment tasks for larger partitions and stealing tasks for smaller partitions. In a stage, stealing tasks could actively steal and process data from segment tasks after processing their own. The optimization achieves significant performance improvements from 16% up to 67% on different sizes and distributions of data. Experiments show that involved overhead is minimal and could be negligible.

中文翻译:

使用任务窃取处理 Spark SQL 中聚合的数据倾斜

在分布式内存计算系统中,数据分布对性能有很大影响。设计一个好的分区算法很困难,需要用户对数据有足够的先验知识,这使得数据倾斜在现实中很常见。通过采样和重新分区来处理数据倾斜的传统方法通常会产生额外的开销。在本文中,我们提出了对聚合操作符的动态执行优化,聚合操作符是 Spark SQL 中最通用和最昂贵的操作符之一。我们的优化旨在避免额外的开销并在发生数据倾斜时提高性能。核心思想是任务窃取。根据数据分区的相对大小,我们添加了两种类型的任务,即针对较大分区的分段任务和针对较小分区的窃取任务。在一个阶段,窃取任务可以在处理完自己的数据后主动窃取和处理来自段任务的数据。该优化在不同大小和分布的数据上实现了从 16% 到 67% 的显着性能改进。实验表明,所涉及的开销很小,可以忽略不计。
更新日期:2020-03-25
down
wechat
bug