Optimal Checkpointing Strategies for Iterative Applications,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimal Checkpointing Strategies for Iterative Applications
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-07-26 , DOI: 10.1109/tpds.2021.3099440
Yishu Du , Loris Marchal , Guillaume Pallez , Yves Robert

This work provides an optimal checkpointing strategy to protect iterative applications from fail-stop errors. We consider a general framework, where the application repeats the same execution pattern by executing consecutive iterations, and where each iteration is composed of several tasks. These tasks have different execution lengths and different checkpoint costs. Assume that there are

$n$

tasks and that task

$a_{i}$

, where

$0 \leq i < n$

, has execution time

$t_{i}$

and checkpoint cost

$c_{i}$

. A naive strategy would checkpoint after each task. Another naive strategy would checkpoint at the end of each iteration. A strategy inspired by the Young/Daly formula would work for

$\sqrt{2 \mu c_{{ave}} }$

seconds, where

$\mu$

is the application MTBF and

$c_{{ave}}$

is the average checkpoint time, and checkpoint at the end of the current task (and repeat). Another strategy, also inspired by the Young/Daly formula, would select the task

$a_{\min }$

with smallest checkpoint cost

$c_{\min }$

and would checkpoint after every

$p^{\text{th}}$

instance of that task, leading to a checkpointing period

$p T$

, where

$T = \sum _{i=0}^{n-1} a_{i}$

is the time per iteration. One would choose the period so that

$p T \approx \sqrt{2 \mu c_{\min }}$

to obey the Young/Daly formula. All these naive and Young/Daly strategies are suboptimal. Our main contribution is to show that the optimal checkpoint strategy is globally periodic, and to design a dynamic programming algorithm that computes the optimal checkpointing pattern. This pattern may well checkpoint many different tasks, and this across many different iterations. We show through simulations, both from synthetic and real-life application scenarios, that the optimal strategy outperforms the naive and Young/Daly strategies.

中文翻译：

迭代应用程序的最佳检查点策略

这项工作提供了一种最佳的检查点策略来保护迭代应用程序免受故障停止错误的影响。我们考虑一个通用框架，其中应用程序通过执行连续迭代来重复相同的执行模式，并且每个迭代由多个任务组成。这些任务具有不同的执行长度和不同的检查点成本。假设有 $n$ 个任务，并且任务 $a_{i}$ （其中 $0 \leq i < n$ ）具有执行时间 $t_{i}$ 和检查点成本 $c_{i}$ 。一个简单的策略是在每个任务之后设置检查点。另一种简单的策略是在每次迭代结束时设置检查点。受 Young/Daly 公式启发的策略适用于 $\sqrt{2 \mu c_{{ave}} }$ 秒，其中 $\mu$ 是应用程序 MTBF，$c_{{ave}}$ 是平均值检查点时间，以及当前任务结束时的检查点（并重复）。另一种策略也受到 Young/Daly 公式的启发，将选择检查点成本 $c_{\min }$ 最小的任务 $a_{\min }$，并在每个 $p^{\text{th}}$ 之后设置检查点该任务的实例，导致检查点周期 $p T$ ，其中 $T = \sum _{i=0}^{n-1} a_{i}$ 是每次迭代的时间。人们会选择一个周期，使 $p T \approx \sqrt{2 \mu c_{\min }}$ 遵守杨/戴利公式。所有这些幼稚的年轻/戴利策略都不是最佳的。我们的主要贡献是证明最佳检查点策略是全局周期性的，并设计一种计算最佳检查点模式的动态编程算法。这种模式很可能检查许多不同的任务，并且跨越许多不同的迭代。我们通过合成和现实应用场景的模拟表明，最优策略优于朴素策略和 Young/Daly 策略。

更新日期：2021-07-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11