Hawkeye: Adaptive Straggler Identification on Heterogeneous Spark Cluster with Reinforcement Learning,IEEE Access

当前位置： X-MOL 学术 › IEEE Access › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hawkeye: Adaptive Straggler Identification on Heterogeneous Spark Cluster with Reinforcement Learning
IEEE Access ( IF 3.4 ) Pub Date : 2020-01-01 , DOI: 10.1109/access.2020.2982320
Haizhou Du , Shaohua Zhang

It is a common sense that people harbors the belief that stragglers exert huge influence upon the performance conducted by the analysis systems of big data for the reason of poor performance made by some computing nodes, data skew and so on. Accordingly, stragglers have been billed as an indispensable bottleneck in Map-Reduce framework processing. However, existing studies on stragglers identification are targeting coarse-grained detection, schedule level optimization and off-line log based cause analysis. Accuracy identifying the stragglers in time for each job, however, is an extremely tough because (1) Auite a number of root causes for stragglers in data analytics frameworks;(2) The number of key parameters affecting stragglers identification; and (3) The different clusters configurations, and their impact on the stragglers detection, vary among different job types and sizes. Either existing solutions adopt a “tweak-and-pray” manual tuning approach, which is complex, time-consuming and error-prone, or only most of them fix theirs eyes upon coarse-grained straggler detection. In this paper, we systematically conduct the exploration on the fundamental problem of automatic, adaptive straggler identification on big data analytics platform. Under the inspiration of the recent triumphs over implementing Reinforcement Learning (RL) techniques for solving complex online optimal problems, we conducted investigation that Reinforcement learning are reasonably employed to adaptively opt the optimal parameters to identify stragglers free of the intervention of human beings. Specifically, we propose Hawkeye, a general adaptive speculative execution system which identifies stragglers by reinforcement are learning to launch speculative tasks on heterogeneous cluster at runtime. In accordance with the experimental conclusions, Hawkeye manages to cut down the job completion time over the distinct type applications. An instance is that it reveals as many as nearly 37% decrease average job completion time based on an improvement of 23% on the preciseness of the present resolutions to the heterogeneous cluster.

中文翻译：

Hawkeye：基于强化学习的异构 Spark 集群的自适应 Straggler 识别

人们普遍认为，由于某些计算节点的性能不佳、数据倾斜等原因，落后者对大数据分析系统的性能产生了巨大的影响。因此，落后者被称为 Map-Reduce 框架处理中不可或缺的瓶颈。然而，现有的落后者识别研究主要针对粗粒度检测、调度级别优化和基于离线日志的原因分析。然而，准确地识别每项工作的落后者是一项极其困难的工作，因为 (1) 在数据分析框架中找出导致落后者的许多根本原因；(2) 影响落后者识别的关键参数的数量；(3) 不同的集群配置，以及它们对落后者检测的影响，不同的工作类型和规模有所不同。现有的解决方案要么采用复杂、耗时且容易出错的“tweak-and-pray”手动调整方法，要么只有大多数解决方案将目光投向了粗粒度的落后者检测。在本文中，我们系统地对大数据分析平台上自动、自适应的落后者识别的基本问题进行了探索。在最近实现强化学习 (RL) 技术解决复杂在线优化问题的胜利的启发下，我们进行了调查，以合理地使用强化学习来自适应地选择最佳参数，以在没有人类干预的情况下识别掉队者。具体来说，我们建议鹰眼，通过强化识别落后者的通用自适应推测执行系统正在学习在运行时在异构集群上启动推测任务。根据实验结论，Hawkeye 设法减少了不同类型应用程序的工作完成时间。一个例子是，基于对异构集群的当前分辨率的 23% 的改进，它表明平均作业完成时间减少了近 37%。

更新日期：2020-01-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11