当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Failure Analysis of Hadoop Schedulers using an Integration of Model Checking and Simulation
arXiv - CS - Performance Pub Date : 2021-09-07 , DOI: arxiv-2109.04196
Mbarka Soualhia, Foutse Khomh, Sofiene Tahar

The Hadoop scheduler is a centerpiece of Hadoop, the leading processing framework for data-intensive applications in the cloud. Given the impact of failures on the performance of applications running on Hadoop, testing and verifying the performance of the Hadoop scheduler is critical. Existing approaches such as performance simulation and analytical modeling are inadequate because they are not able to ascertain a complete verification of a Hadoop scheduler. This is due to the wide range of constraints and aspects involved in Hadoop. In this paper, we propose a novel methodology that integrates and combines simulation and model checking techniques to perform a formal verification of Hadoop schedulers, focusing on the following properties: schedulability, fairness and resources-deadlock freeness. We use the CSP language to formally describe a Hadoop scheduler, and the PAT model checker to verify its properties. Next, we use the proposed formal model to analyze the scheduler of OpenCloud, a Hadoop-based cluster that simulates the Hadoop load, in order to illustrate the usability and benefits of our work. Results show that our proposed methodology can help identify several tasks failures (up to 78%) early on, i.e., before the tasks are executed on the cluster.

中文翻译:

使用模型检查和仿真的集成对 Hadoop 调度程序进行故障分析

Hadoop 调度程序是 Hadoop 的核心,Hadoop 是云中数据密集型应用程序的领先处理框架。鉴于故障对在 Hadoop 上运行的应用程序性能的影响,测试和验证 Hadoop 调度程序的性能至关重要。现有的方法如性能模拟和分析建模是不够的,因为它们无法确定 Hadoop 调度程序的完整验证。这是由于 Hadoop 中涉及的广泛约束和方面。在本文中,我们提出了一种集成并结合模拟和模型检查技术的新方法来执行 Hadoop 调度程序的形式验证,重点关注以下属性:可调度性、公平性和资源死锁自由度。我们使用 CSP 语言来正式描述一个 Hadoop 调度器,并使用 PAT 模型检查器来验证其属性。接下来,我们使用提出的形式模型来分析 OpenCloud 的调度程序,OpenCloud 是一个模拟 Hadoop 负载的基于 Hadoop 的集群,以说明我们工作的可用性和好处。结果表明,我们提出的方法可以帮助早期识别多个任务失败(高达 78%),即在集群上执行任务之前。
更新日期:2021-09-10
down
wechat
bug