当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-02-01 , DOI: 10.1109/tpds.2020.3015615
Alvaro Wong , Elisa Heymann , Dolores Rexachs , Emilio Luque

Compute node failures are becoming a normal event for many long-running and scalable MPI applications. Keeping within the MPI standards and applying some of the methods developed so far in terms of fault tolerance, we developed a methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture. To do this, we developed the ULSC$^{2}$2-RADIC middleware that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi-coordinated environment. We performed experimental results using scientific applications and the NAS Parallel Benchmarks to assess the overhead and also the functionality in case of a node failure. We evaluated the computational cost of the semi-coordinated checkpoints compared with the coordinated checkpoints.

中文翻译:

使用半协调检查点管理容错的中间件

计算节点故障正在成为许多长期运行且可扩展的 MPI 应用程序的正常事件。遵循 MPI 标准并应用迄今为止在容错方面开发的一些方法,我们开发了一种方法,该方法允许应用程序通过在 RADIC 架构内创建半协调检查点来容忍故障。为此,我们开发了 ULSC$^{2}$2-RADIC 中间件,将应用程序划分为独立的 MPI 世界,其中每个 MPI 世界对应一个计算节点,并在半协调环境中使用 DMTCP 检查点库。我们使用科学应用程序和 NAS Parallel Benchmarks 执行实验结果,以评估开销以及节点故障情况下的功能。我们评估了半协调检查点与协调检查点的计算成本。
更新日期:2021-02-01
down
wechat
bug