Jargon of Hadoop MapReduce scheduling techniques: a scientific categorization,The Knowledge Engineering Review

当前位置： X-MOL 学术 › Knowl. Eng. Rev. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Jargon of Hadoop MapReduce scheduling techniques: a scientific categorization
The Knowledge Engineering Review ( IF 2.8 ) Pub Date : 2020-09-02 , DOI: 10.1017/s0269888918000371
Muhammad Hanif , Choonhwa Lee

Recently, valuable knowledge that can be retrieved from a huge volume of datasets (called Big Data) set in motion the development of frameworks to process data based on parallel and distributed computing, including Apache Hadoop, Facebook Corona, and Microsoft Dryad. Apache Hadoop is an open source implementation of Google MapReduce that attracted strong attention from the research community both in academia and industry. Hadoop MapReduce scheduling algorithms play a critical role in the management of large commodity clusters, controlling QoS requirements by supervising users, jobs, and tasks execution. Hadoop MapReduce comprises three schedulers: FIFO, Fair, and Capacity. However, the research community has developed new optimizations to consider advances and dynamic changes in hardware and operating environments. Numerous efforts have been made in the literature to address issues of network congestion, straggling, data locality, heterogeneity, resource under-utilization, and skew mitigation in Hadoop scheduling. Recently, the volume of research published in journals and conferences about Hadoop scheduling has consistently increased, which makes it difficult for researchers to grasp the overall view of research and areas that require further investigation. A scientific literature review has been conducted in this study to assess preceding research contributions to the Apache Hadoop scheduling mechanism. We classify and quantify the main issues addressed in the literature based on their jargon and areas addressed. Moreover, we explain and discuss the various challenges and open issue aspects in Hadoop scheduling optimizations.

中文翻译：

Hadoop MapReduce 调度技术术语：科学分类

最近，可以从大量数据集（称为大数据）中检索到的宝贵知识推动了基于并行和分布式计算处理数据的框架的开发，包括 Apache Hadoop、Facebook Corona 和 Microsoft Dryad。Apache Hadoop 是 Google MapReduce 的开源实现，引起了学术界和工业界研究社区的强烈关注。Hadoop MapReduce 调度算法在大型商品集群的管理中发挥着至关重要的作用，通过监督用户、作业和任务执行来控制 QoS 要求。Hadoop MapReduce 包含三个调度器：FIFO、Fair 和 Capacity。然而，研究界已经开发出新的优化来考虑硬件和操作环境的进步和动态变化。文献中已经做出了许多努力来解决 Hadoop 调度中的网络拥塞、分散、数据局部性、异质性、资源利用不足和倾斜缓解等问题。最近，在期刊和会议上发表的关于 Hadoop 调度的研究数量不断增加，这使得研究人员难以掌握研究的整体观点和需要进一步研究的领域。本研究对科学文献进行了回顾，以评估先前对 Apache Hadoop 调度机制的研究贡献。我们根据术语和涉及的领域对文献中解决的主要问题进行分类和量化。此外，我们解释和讨论了 Hadoop 调度优化中的各种挑战和未解决的问题。

更新日期：2020-09-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文