Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment,The Journal of Supercomputing

当前位置： X-MOL 学术 › J. Supercomput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment
The Journal of Supercomputing ( IF 2.5 ) Pub Date : 2021-08-02 , DOI: 10.1007/s11227-021-04000-2
Chunlin Li _{1,

2} , Qianqian Cai ₂ , Youlong Luo ₂

Affiliation

Both data shuffling and cache recovery are essential parts of the Spark system, and they directly affect Spark parallel computing performance. Existing dynamic partitioning schemes to solve the data skewing problem in the data shuffle phase suffer from poor dynamic adaptability and insufficient granularity. To address the above problems, this paper proposes a dynamic balanced partitioning method for the shuffle phase based on reservoir sampling. The method mitigates the impact of data skew on Spark performance by sampling and preprocessing intermediate data, predicting the overall data skew, and giving the overall partitioning strategy executed by the application. In addition, an inappropriate failure recovery strategy increases the recovery overhead and leads to an inefficient data recovery mechanism. To address the above issues, this paper proposes a checkpoint-based fast recovery strategy for the RDD cache. The strategy analyzes the task execution mechanism of the in-memory computing framework and forms a new failure recovery strategy using the failure recovery model plus weight information based on the semantic analysis of the code to obtain detailed information about the task, so as to improve the efficiency of the data recovery mechanism. The experimental results show that the proposed dynamic balanced partitioning approach can effectively optimize the total completion time of the application and improve Spark parallel computing performance. The proposed cache fast recovery strategy can effectively improve the computational speed of data recovery and the computational rate of Spark.

中文翻译：

Spark环境下基于数据平衡的中间数据分区和基于检查点的缓存恢复

数据 shuffle 和缓存恢复都是 Spark 系统的重要组成部分，它们直接影响 Spark 并行计算的性能。现有的解决数据shuffle阶段数据倾斜问题的动态分区方案存在动态适应性差、粒度不够等问题。针对上述问题，本文提出了一种基于储层采样的shuffle阶段动态平衡分区方法。该方法通过对中间数据进行采样和预处理，预测整体数据倾斜，并给出应用程序执行的整体分区策略来减轻数据倾斜对 Spark 性能的影响。此外，不恰当的故障恢复策略会增加恢复开销，导致数据恢复机制效率低下。针对以上问题，本文提出了一种基于检查点的RDD缓存快速恢复策略。该策略分析内存计算框架的任务执行机制，基于代码语义分析，利用故障恢复模型加权重信息形成新的故障恢复策略，获取任务的详细信息，从而提高数据恢复机制的效率。实验结果表明，所提出的动态平衡分区方法可以有效优化应用程序的总完成时间，提高Spark并行计算性能。提出的缓存快速恢复策略可以有效提高数据恢复的计算速度和Spark的计算速度。该策略分析内存计算框架的任务执行机制，基于代码语义分析，利用故障恢复模型加权重信息形成新的故障恢复策略，获取任务的详细信息，从而提高数据恢复机制的效率。实验结果表明，所提出的动态平衡分区方法可以有效优化应用程序的总完成时间，提高Spark并行计算性能。提出的缓存快速恢复策略可以有效提高数据恢复的计算速度和Spark的计算速度。该策略分析内存计算框架的任务执行机制，基于代码语义分析，利用故障恢复模型加权重信息形成新的故障恢复策略，获取任务的详细信息，从而提高数据恢复机制的效率。实验结果表明，所提出的动态平衡分区方法可以有效优化应用程序的总完成时间，提高Spark并行计算性能。提出的缓存快速恢复策略可以有效提高数据恢复的计算速度和Spark的计算速度。

更新日期：2021-08-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文