当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
Future Generation Computer Systems ( IF 7.5 ) Pub Date : 2020-07-07 , DOI: 10.1016/j.future.2020.07.003
Diego Montezanti , Enzo Rucci , Armando De Giusti , Marcelo Naiouf , Dolores Rexachs , Emilio Luque

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments.



中文翻译:

基于复制结合不同级别的检查点的软错误检测和自动恢复

在HPC中,处理故障已成为日益关注的问题。在未来的百亿亿次系统中,预计每天都会发生几次未检测到的无提示错误,从而增加了错误结果的发生率。在本文中,我们提出了SEDAR,它是一种提高了在运行并行消息传递应用程序时针对瞬态故障的系统可靠性的方法。我们的方法基于用于检测的过程复制,结合不同级别的检查点以进行自动恢复,其目标是帮助科学应用程序的用户获得具有正确结果的执行结果。SEDAR分为三个级别:(1)仅检测和带通知的安全停止;(2)基于多个系统级检查点的恢复;(3)基于单个有效的用户级别检查点进行恢复。由于这些变体均提供特定的覆盖范围,但涉及限制和实施成本,因此SEDAR可以适应系统的需求。在这项工作中,对方法进行了描述,并在无故障和有故障的情况下,以数学方式描述了采用每种SEDAR策略的时间行为。引入了一个考虑测试应用程序中所有故障场景的模型,以显示检测和恢复机制的有效性。每个变体的开销评估是通过涉及不同通信模式的应用程序执行的;这也可用于提取有关何时使用每个SEDAR保护等级的指南。结果,我们证明了其在目标HPC环境中耐受瞬态故障的功效和可行性。

更新日期:2020-07-07
down
wechat
bug