Fault-Detection Managers: More May Not Be the Merrier,Journal of Grid Computing

当前位置： X-MOL 学术 › J. Grid Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fault-Detection Managers: More May Not Be the Merrier
Journal of Grid Computing ( IF 5.5 ) Pub Date : 2021-02-20 , DOI: 10.1007/s10723-021-09546-2
Ghazal Zamani , Olivia Das

A fault management system contains managers that detect faults as well as initiate recovery actions. Such management systems often come in an architecture that is not only a distributed one but also decoupled from the applications. Although an arrangement like this promotes scalability, it unfortunately makes the recovery of applications dependent on the fault management system itself. This work introduces two novel equations to meet the performance objectives of applications. To this end, we first create an equation that estimates the maximum number of jobs to be handled by an application instance for meeting a given performance objective. This formula is then used by admission control mechanism to restrict the number of jobs (targeted for operational application instances) to be allowed to enter the system. Next, we create a second equation that computes the response time distribution of an application. Thereafter, we develop a simulation model that predicts the impact of the failure of four sample fault management architectures on application’s performance. Exploiting our equations, we compare the architectures in terms of three distinct ways of handling affected jobs when application instances fail—allow job loss; retry jobs resulting in overload; employ admission control to mitigate the overload. Our simulation results show that boosting the number of managers may not always be beneficial; rather, it could possibly be the interconnection topology (i.e. the layout of interconnects linking the architectural components) of the management architecture, together with the model parameter values that may sometimes have a bigger role to play in the application’s performance.

中文翻译：

故障检测经理：更多可能不是问题

故障管理系统包含检测故障并启动恢复操作的管理器。这样的管理系统通常采用的体系结构不仅是分布式体系结构，而且与应用程序分离。尽管这样的安排提高了可伸缩性，但不幸的是，它使应用程序的恢复依赖于故障管理系统本身。这项工作引入了两个新颖的方程式来满足应用程序的性能目标。为此，我们首先创建一个方程式，以估计应用程序实例为达到给定性能目标而要处理的最大作业数。然后，准入控制机制使用此公式来限制允许进入系统的作业数（针对操作应用程序实例）。下一个，我们创建了第二个方程，用于计算应用程序的响应时间分布。此后，我们开发了一个仿真模型，该模型可以预测四种示例故障管理体系结构的故障对应用程序性能的影响。利用我们的方程式，我们在应用程序实例失败时从三种不同的方式处理受影响的作业的方式上比较了体系结构。重试作业，导致过载；采用准入控制来减轻过载。我们的模拟结果表明，增加经理人数可能并不总是有益的。相反，它可能是管理体系结构的互连拓扑（即，链接体系结构组件的互连的布局），我们开发了一个仿真模型，该模型可以预测四种示例故障管理架构的故障对应用程序性能的影响。利用我们的方程式，我们在应用程序实例失败时以三种不同的方式处理受影响的作业的方式比较了体系结构。重试作业，导致过载；采用准入控制来减轻过载。我们的模拟结果表明，增加经理人数可能并不总是有益的。相反，它可能是管理体系结构的互连拓扑（即，链接体系结构组件的互连的布局），我们开发了一个仿真模型，该模型可以预测四种示例故障管理架构的故障对应用程序性能的影响。利用我们的方程式，我们在应用程序实例失败时以三种不同的方式处理受影响的作业的方式比较了体系结构。重试作业，导致过载；采用准入控制来减轻过载。我们的模拟结果表明，增加经理人数可能并不总是有益的。相反，它可能是管理体系结构的互连拓扑（即，链接体系结构组件的互连的布局），当应用程序实例发生故障时，我们从三种不同的方式处理受影响的作业的方式来比较体系结构。重试作业，导致过载；采用准入控制来减轻过载。我们的模拟结果表明，增加经理人数可能并不总是有益的。相反，它可能是管理体系结构的互连拓扑（即，链接体系结构组件的互连的布局），当应用程序实例发生故障时，我们从三种不同的方式处理受影响的作业的方式来比较体系结构。重试作业，导致过载；采用准入控制来减轻过载。我们的模拟结果表明，增加经理人数可能并不总是有益的。相反，它可能是管理体系结构的互连拓扑（即，链接体系结构组件的互连的布局），以及有时可能在应用程序性能中发挥更大作用的模型参数值。

更新日期：2021-02-21

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>