当前位置: X-MOL 学术Front. Comput. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
User-level failure detection and auto-recovery of parallel programs in HPC systems
Frontiers of Computer Science ( IF 4.2 ) Pub Date : 2021-09-01 , DOI: 10.1007/s11704-020-0190-y
Guozhen Zhang 1, 2, 3 , Hailong Yang 1, 2, 3 , Yi Liu 2, 3 , Depei Qian 2, 3 , Jun Xu 4
Affiliation  

As the mean-time-between-failures (MTBF) continues to decline with the increasing number of components on large-scale high performance computing (HPC) systems, program failures might occur during the execution period with high probability. Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned. From the user perspective, if the program failure cannot be detected and handled in time, it would waste resources and delay the progress of program execution. Unfortunately, the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege. Currently, automated tools for supporting user-level failure detection and auto-recovery of parallel programs in HPC systems are missing. This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs. The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs. In addition, we propose a dual-checker mechanism to improve the robustness of our approach. We implement the proposed method as a tool named automatic re-launcher (ARL) and evaluate it on the Tianhe-2 system. Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system. In addition, the communication and performance overhead caused by ARL is negligible. The good scalability of ARL makes it applicable for large-scale HPC systems.



中文翻译:

HPC系统中并行程序的用户级故障检测和自动恢复

随着大规模高性能计算 (HPC) 系统上组件数量的增加,平均故障间隔时间 (MTBF) 不断下降,程序执行期间很可能会发生故障。确保 HPC 程序的成功执行已成为非特权用户应该关注的问题。从用户的角度来看,如果不能及时发现和处理程序故障,就会浪费资源,延迟程序执行进度。不幸的是,由于作业管理系统的执行控制以及有限的权限,非特权用户无法进行程序状态检查。目前,缺少用于支持 HPC 系统中并行程序的用户级故障检测和自动恢复的自动化工具。本文针对非特权用户提出了一种创新的方法,实现作业执行的失败检测和失败作业的自动重新提交。我们方法中的状态检查器被封装为一个独立的作业,以减少对用户作业的干扰。此外,我们提出了一种双重检查器机制来提高我们方法的鲁棒性。我们将所提出的方法实施为名为自动重新启动器(ARL)的工具,并在天河二号系统上对其进行评估。实验结果表明,ARL 可以有效地检测天河二号系统上的执行失败。此外,ARL 带来的通信和性能开销可以忽略不计。ARL 良好的可扩展性使其适用于大型 HPC 系统。

更新日期:2021-09-02
down
wechat
bug