A Dynamic and Failure-aware Task Scheduling Framework for Hadoop,IEEE Transactions on Cloud Computing

当前位置： X-MOL 学术 › IEEE Trans. Cloud Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Dynamic and Failure-aware Task Scheduling Framework for Hadoop
IEEE Transactions on Cloud Computing ( IF 5.3 ) Pub Date : 2020-04-01 , DOI: 10.1109/tcc.2018.2805812
Mbarka Soualhia , Foutse Khomh , Sofiene Tahar

Hadoop has become a popular framework for processing data-intensive applications in cloud environments. A core constituent of Hadoop is the scheduler, which is responsible for scheduling and monitoring the jobs and tasks, and rescheduling them in case of failures. Although fault-tolerance mechanisms have been proposed for Hadoop, the performance of Hadoop can be significantly impacted by unforeseen events in the cloud environment. In this paper, we introduce a dynamic and failure-aware framework that can be integrated within Hadoop scheduler and adjust the scheduling decisions based on collected information about the cloud environment. Our framework relies on predictions made by machine learning algorithms and scheduling policies generated by a Markovian Decision Process (MDP), to adjust its scheduling decisions on the fly. Instead of the fixed heartbeat-based failure detection commonly used in Hadoop to track active TaskTrackers (i.e., nodes that process the scheduled tasks), our proposed framework implements an adaptive algorithm that can dynamically detect the failures of the TaskTracker. To deploy our proposed framework, we have built, ATLAS+, an AdapTive Failure-Aware Scheduler for Hadoop. To assess the performance of ATLAS+, we conduct a large empirical study on a 100-node Hadoop cluster deployed on Amazon Elastic MapReduce (EMR), comparing the performance of ATLAS+ with those of three Hadoop schedulers (FIFO, Fair, and Capacity). Results show that ATLAS+ outperforms FIFO, Fair, and Capacity schedulers. ATLAS+ can reduce the number of failed jobs by up to 43 percent and the number of failed tasks by up to 59 percent. On average, ATLAS+ could reduce the total execution time of jobs by 10 minutes, which represents 40 percent of the job execution times, and by up to 3 minutes for tasks, which represents 47 percent of the task execution time. ATLAS+ also reduced CPU and memory usage by 22 and 20 percent, respectively.

中文翻译：

Hadoop 的动态和故障感知任务调度框架

Hadoop 已成为在云环境中处理数据密集型应用程序的流行框架。Hadoop 的一个核心组成部分是调度器，它负责调度和监控作业和任务，并在出现故障时重新调度它们。尽管已经为 Hadoop 提出了容错机制，但 Hadoop 的性能可能会受到云环境中不可预见事件的显着影响。在本文中，我们介绍了一个动态和故障感知框架，它可以集成到 Hadoop 调度程序中，并根据收集的有关云环境的信息调整调度决策。我们的框架依赖于机器学习算法和马尔可夫决策过程 (MDP) 生成的调度策略所做的预测，以动态调整其调度决策。我们提出的框架不是在 Hadoop 中常用的基于心跳的固定故障检测来跟踪活动的 TaskTracker（即处理计划任务的节点），而是实现了一种自适应算法，可以动态检测 TaskTracker 的故障。为了部署我们提出的框架，我们构建了 ATLAS+，一个适用于 Hadoop 的自适应故障感知调度程序。为了评估 ATLAS+ 的性能，我们对部署在 Amazon Elastic MapReduce (EMR) 上的 100 节点 Hadoop 集群进行了一项大型实证研究，将 ATLAS+ 的性能与三个 Hadoop 调度程序（FIFO、公平和容量）的性能进行了比较。结果表明，ATLAS+ 优于 FIFO、公平和容量调度程序。ATLAS+ 可以将失败作业的数量减少多达 43%，并将失败的任务数量减少多达 59%。一般，ATLAS+ 可以将作业的总执行时间减少 10 分钟，占作业执行时间的 40%，任务最多可减少 3 分钟，占任务执行时间的 47%。ATLAS+ 还分别将 CPU 和内存使用量减少了 22% 和 20%。

更新日期：2020-04-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11