当前位置: X-MOL 学术Proc. Inst. Mech. Eng. Part O J. Risk Reliab. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure
Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability ( IF 2.1 ) Pub Date : 2020-05-04 , DOI: 10.1177/1748006x19893569
Bentolhoda Jafary, Lance Fiondella, Ping-Chen Chang

Checkpointing is a technique to back up work at periodic intervals so that if computation fails, it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This article presents a method to model the impact of correlated failures on an application that performs a specified amount of computation and implements checkpointing operations at equidistant periods during this computation. We develop a Markov model and superimpose a correlated life distribution. Two cases are considered. The first assumes that reaching a checkpoint resets the failure distribution. The second allows the probability of failure to progress. We illustrate the approach through a series of examples. The results indicate that correlation can negatively impact checkpointing, necessitating more frequent checkpointing and increasing the total time required, but that the approach can still identify the optimal number of equidistant checkpoints, despite this correlation.



中文翻译:

具有相关故障的容错系统的最佳等距检查点

检查点是一种定期备份工作的技术,因此,如果计算失败,则不必从头开始重新启动,而可以从最新的检查点重新开始。执行检查点操作需要时间。因此,有必要考虑执行检查点操作的时间与在检查点重新开始计算时节省的时间之间的折衷。本文介绍了一种方法,该方法可以对相关故障对应用程序的影响进行建模,该应用程序执行指定的计算量并在此计算过程中的等距离周期内执行检查点操作。我们开发了一个马尔可夫模型并叠加了相关的寿命分布。考虑了两种情况。第一个假设到达检查点将重置故障分布。第二个允许失败的可能性继续发展。我们通过一系列示例来说明这种方法。结果表明,相关性可能会对检查点产生负面影响,需要更频繁地进行检查,并增加了所需的总时间,但是尽管存在这种相关性,该方法仍可以识别等距检查点的最佳数量。

更新日期:2020-05-04
down
wechat
bug