当前位置: X-MOL 学术ACM Comput. Surv. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Predictive Reliability and Fault Management in Exascale Systems
ACM Computing Surveys ( IF 23.8 ) Pub Date : 2020-09-28 , DOI: 10.1145/3403956
Ramon Canal 1 , Carles Hernandez 2 , Rafa Tornero 2 , Alessandro Cilardo 3 , Giuseppe Massari 4 , Federico Reghenzani 4 , William Fornaciari 4 , Marina Zapater 5 , David Atienza 5 , Ariel Oleksiak 6 , Wojciech PiĄtek 6 , Jaume Abella 7
Affiliation  

Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.

中文翻译:

Exascale 系统中的预测可靠性和故障管理

性能和功率限制与未来 Exascale 系统中的互补金属氧化物半导体技术缩放相结合。技术扩展使每个单独的晶体管更容易出现故障,并且由于每个芯片的设备数量呈指数增长,系统故障率更高。因此,高性能计算 (HPC) 系统需要集成预测、检测和恢复机制以有效应对故障。本文回顾了 HPC 系统中的故障检测、故障预测和恢复技术,从电子到系统级。我们分析它们的优势和局限性。最后,我们确定了满足 Exascale 系统可靠性水平的有希望的途径。
更新日期:2020-09-28
down
wechat
bug