ImRP: A Predictive Partition Method for Data Skew Alleviation in Spark Streaming Environment,Parallel Computing

当前位置： X-MOL 学术 › Parallel Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ImRP: A Predictive Partition Method for Data Skew Alleviation in Spark Streaming Environment
Parallel Computing ( IF 2.0 ) Pub Date : 2020-10-02 , DOI: 10.1016/j.parco.2020.102699
Zhongming Fu , Zhuo Tang , Li Yang , Kenli Li , Keqin Li

Spark Streaming is an extension of the core Spark engine that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It treats stream as a series of deterministic batches and handles them as regular jobs. However, for a stream job responsible for a batch, data skew (i.e., the imbalance in the amount of data allocated to each reduce task), can degrade the job performance significantly because of load imbalance. In this paper, we propose an improved range partitioner (ImRP) to alleviate the reduce skew for stream jobs in Spark Streaming. Unlike previous work, ImRP does not require any pre-run sampling of input data and generates the data partition scheme based on the intermediate data distribution estimated by the previous batch processing, in which a prediction model EWMA (Exponentially Weighted Moving Average) is adopted. To lighten the data skew, ImRP presents a novel method of calculating the partition borders optimally, and a mechanism of splitting the border key clusters when the semantics of shuffle operators permit. Besides, ImRP considers the integrated partition size and heterogeneity of computing environments when balancing the load among reduce tasks appropriately. We implement ImRP in Spark-3.0 and evaluate its performance on four representative benchmarks: wordCount, sort, pageRank, and LDA. The results show that by mitigating the data skew, ImRP can decrease the execution time of stream jobs substantially compared with some other partition strategies, especially when the skew degree of input batch is serious.

中文翻译：

ImRP：一种用于火花流环境中缓解数据偏斜的预测分区方法

Spark Streaming是核心Spark引擎的扩展，可实现实时数据流的可伸缩，高吞吐量，容错流处理。它将流视为一系列确定性批次，并将其作为常规作业处理。但是，对于负责批处理的流作业，由于负载不平衡，数据偏斜（即分配给每个缩减任务的数据量的不平衡）会显着降低作业性能。在本文中，我们提出了一种改进的范围分区器（ImRP），以减轻Spark Streaming中流作业的减少偏斜。与先前的工作不同，ImRP不需要对输入数据进行任何预运行采样，而是根据先前的批处理估计的中间数据分布生成数据分区方案，其中采用了预测模型EWMA（指数加权移动平均值）。为了减轻数据偏斜，ImRP提出了一种最佳计算分区边界的新颖方法，以及一种在随机操作符的语义允许的情况下分割边界键簇的机制。此外，ImRP在适当地减少精简任务之间的负载时会考虑集成的分区大小和计算环境的异构性。我们在Spark-3.0中实施ImRP，并根据以下四个代表性基准评估其性能：在适当地减少约简任务之间的负载时，ImRP会考虑计算环境的集成分区大小和异构性。我们在Spark-3.0中实施ImRP，并根据以下四个代表性基准评估其性能：在适当地减少约简任务之间的负载时，ImRP会考虑计算环境的集成分区大小和异构性。我们在Spark-3.0中实施ImRP，并根据以下四个代表性基准评估其性能：wordCount，sort，pageRank和LDA。结果表明，与其他分区策略相比，ImRP通过减轻数据偏斜，可以显着减少流作业的执行时间，尤其是在输入批次的偏斜度严重的情况下。

更新日期：2020-10-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11