当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications
arXiv - CS - Performance Pub Date : 2021-04-29 , DOI: arxiv-2104.14246
Roberto Rocco, Davide Gadioli, Gianluca Palermo

Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more constraining. With the introduction of ULFM (User Level Fault Mitigation library), it has been provided with a possible way to overtake a fault during the application execution at the cost of code modifications. ULFM is intrusive in the application and requires also a deep understanding of its recovery procedures. In this paper we propose Legio, a framework that lowers the complexity of introducing resiliency in an embarrassingly parallel MPI application. By hiding ULFM behind the MPI calls, the library is capable to expose resiliency features to the application in a transparent manner thus removing any integration effort. Upon fault, the failed nodes are discarded and the execution continues only with the non-failed ones. A hierarchical implementation of the solution has been also proposed to reduce the overhead of the repair process when scaling towards a large number of nodes. We evaluated our solutions on the Marconi100 cluster at CINECA, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI. Moreover, we also integrated the solution in real-world applications to further prove its robustness by injecting faults.

中文翻译:

Legio:令人尴尬的并行MPI应用程序的故障恢复能力

由于HPC机器尺寸的增加,故障的出现正成为应用程序必须面对的偶然事件。从本质上讲,MPI在检测到故障后不为执行提供任何支持,并且这种约束越来越多。随着ULFM(用户级故障缓解库)的引入,它提供了一种可能的方法,可以在应用程序执行期间以代码修改为代价来克服故障。ULFM在应用程序中具有侵入性,还需要对其恢复过程有深入的了解。在本文中,我们提出了Legio,它是一种可降低在尴尬的并行MPI应用程序中引入弹性的复杂性的框架。通过将ULFM隐藏在MPI调用之后,该库能够以透明的方式向应用程序公开弹性功能,从而消除了任何集成工作。发生故障时,将丢弃故障节点,并且仅对非故障节点继续执行。还提出了该解决方案的分层实现,以减少在向大量节点扩展时的修复过程的开销。我们在CINECA的Marconi100集群上评估了我们的解决方案,表明该库引入的开销可以忽略不计,并且不限制MPI的可伸缩性。此外,我们还将该解决方案集成到了实际应用中,以通过注入故障来进一步证明其稳健性。还提出了该解决方案的分层实现,以减少在向大量节点扩展时的修复过程的开销。我们在CINECA的Marconi100集群上评估了我们的解决方案,表明该库引入的开销可以忽略不计,并且不限制MPI的可伸缩性。此外,我们还将该解决方案集成到了实际应用中,以通过注入故障来进一步证明其稳健性。还提出了该解决方案的分层实现,以减少在向大量节点扩展时的修复过程的开销。我们在CINECA的Marconi100集群上评估了我们的解决方案,表明该库引入的开销可以忽略不计,并且不限制MPI的可伸缩性。此外,我们还将该解决方案集成到了实际应用中,以通过注入故障来进一步证明其稳健性。
更新日期:2021-04-30
down
wechat
bug