A Data Skew Oriented Reduce Placement Algorithm Based on Sampling,IEEE Transactions on Cloud Computing

当前位置： X-MOL 学术 › IEEE Trans. Cloud Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Data Skew Oriented Reduce Placement Algorithm Based on Sampling
IEEE Transactions on Cloud Computing ( IF 5.3 ) Pub Date : 2020-10-01 , DOI: 10.1109/tcc.2016.2607738
Zhuo Tang , Wen Ma , Kenli Li , Keqin Li

For frequent disk I/O and large data transmissions among different racks and physical nodes, intermediate data communication has become the most important performance bottle-neck in most running Hadoop systems. This paper proposes a reduce placement algorithm called CORP to schedule related map and reduce tasks on the near nodes of clusters or racks for data locality. Because the number of keys cannot be counted until the input data are processed by map tasks, this paper applies a reservoir algorithm for sampling the input data, which can bring the distribution of keys/values closer to the overall situation of original data. Based on the distribution matrix of the intermediate results in each partition, by calculating the distance and cost matrices among the cross node communication, the related map and reduce tasks can be scheduled to relatively nearby physical nodes for data locality. We implement CORP in Hadoop 2.4.0 and evaluate its performance using three widely used benchmarks: Sort, Grep, and Join. In these experiments, an evaluation model is proposed for selecting the appropriate sample rates, which can comprehensively consider the importance of cost, effect, and variance in sampling. Experimental results show that CORP can not only improve the balance of reduces tasks effectively but also decreases the job execution time for the lower inner data communication. Compared with some other reduce scheduling algorithms, the average data transmission of the entire system on the core switch has been reduced substantially.

中文翻译：

一种基于采样的面向数据倾斜的Reduce放置算法

对于频繁的磁盘 I/O 和不同机架和物理节点之间的大数据传输，中间数据通信已成为大多数运行的 Hadoop 系统中最重要的性能瓶颈。本文提出了一种称为CORP的reduce放置算法，用于在集群或机架的附近节点上调度相关的map和reduce任务，以实现数据局部性。由于在map任务处理输入数据之前无法统计key的数量，本文采用了一种对输入数据进行采样的水库算法，可以使key/value的分布更接近原始数据的整体情况。基于每个分区中间结果的分布矩阵，通过计算跨节点通信之间的距离和代价矩阵，可以将相关的 map 和 reduce 任务调度到相对较近的物理节点，以实现数据局部性。我们在 Hadoop 2.4.0 中实施 CORP，并使用三个广泛使用的基准评估其性能：Sort、Grep 和 Join。在这些实验中，提出了一个评估模型来选择合适的采样率，该模型可以综合考虑采样中成本、效果和方差的重要性。实验结果表明，CORP不仅可以有效提高reduce任务的平衡性，还可以减少底层内部数据通信的作业执行时间。与其他一些reduce调度算法相比，整个系统在核心交换机上的平均数据传输量大幅减少。0 并使用三个广泛使用的基准评估其性能：Sort、Grep 和 Join。在这些实验中，提出了一个评估模型来选择合适的采样率，该模型可以综合考虑采样中成本、效果和方差的重要性。实验结果表明，CORP不仅可以有效提高reduce任务的平衡性，还可以减少底层内部数据通信的作业执行时间。与其他一些reduce调度算法相比，整个系统在核心交换机上的平均数据传输量大幅减少。0 并使用三个广泛使用的基准评估其性能：Sort、Grep 和 Join。在这些实验中，提出了一个评估模型来选择合适的采样率，该模型可以综合考虑采样中成本、效果和方差的重要性。实验结果表明，CORP不仅可以有效提高reduce任务的平衡性，还可以减少底层内部数据通信的作业执行时间。与其他一些reduce调度算法相比，整个系统在核心交换机上的平均数据传输量大幅减少。和抽样方差。实验结果表明，CORP不仅可以有效提高reduce任务的平衡性，还可以减少底层内部数据通信的作业执行时间。与其他一些reduce调度算法相比，整个系统在核心交换机上的平均数据传输量大幅减少。和抽样方差。实验结果表明，CORP不仅可以有效提高reduce任务的平衡性，还可以减少底层内部数据通信的作业执行时间。与其他一些reduce调度算法相比，整个系统在核心交换机上的平均数据传输量大幅减少。

更新日期：2020-10-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11