当前位置: X-MOL 学术J. Supercomput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient detection of silent data corruption in HPC applications with synchronization-free message verification
The Journal of Supercomputing ( IF 2.5 ) Pub Date : 2021-06-09 , DOI: 10.1007/s11227-021-03892-4
Guozhen Zhang , Yi Liu , Hailong Yang , Depei Qian

Nowadays, high-performance computing (HPC) is stepping forward to exascale era. However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous consequences for scientific computation, which jeopardizes the reliability of HPC at large scale. The most commonly used methods to address SDC are based on modular redundancy, which usually requires keeping execution progress consistent between replicas by synchronization and performing additional message transmission and comparison during program execution. Although such methods can detect SDC with high recall, they can introduce significant performance overhead and even stall the execution progress at a large scale. To our knowledge, this paper proposes the first solution of SDC detection without requiring synchronization and additional message transmission between replicas. It combines message logging with an innovative asynchronous message comparison mechanism, which uses specialized service routines (Data-Analytic-Service, DAS) to perform progress comparison without interfering target program execution. Besides, our solution adopts a distributed parallel architecture to perform DAS and utilizes an innovative reference mechanism based on single non-deterministic event to guarantee the consistent execution of different replicas. We implemented a user-level prototype, termed as synchronization-free SDC detection (SFSD). The experimental results on the Tianhe-2 supercomputer show that SFSD is effective in detecting SDC, with low-performance overhead (within 10%) and an acceptable recall rate. Moreover, SFSD exhibits good scalability when applied to large-scale program executions.



中文翻译:

通过无同步消息验证有效检测 HPC 应用程序中的静默数据损坏

如今,高性能计算(HPC)正在迈向百亿亿次时代。然而,表现为比特翻转的静默数据损坏(SDC)会对科学计算造成灾难性的后果,从而危及大规模高性能计算的可靠性。最常用的解决 SDC 的方法是基于模块化冗余,这通常需要通过同步和在程序执行期间执行额外的消息传输和比较来保持副本之间的执行进度一致。虽然这些方法可以检测到具有高召回率的 SDC,但它们会带来显着的性能开销,甚至大规模地拖延执行进度。据我们所知,本文提出了第一个 SDC 检测解决方案,无需在副本之间进行同步和额外的消息传输。它将消息日志记录与创新的异步消息比较机制相结合,使用专门的服务例程(Data-Analytic-Service,DAS)进行进度比较而不干扰目标程序的执行。此外,我们的解决方案采用分布式并行架构来执行DAS,并利用基于单个非确定性事件的创新引用机制来保证不同副本的一致执行。我们实现了一个用户级原型,称为无同步 SDC 检测 (SFSD)。在天河二号超级计算机上的实验结果表明,SFSD 在检测 SDC 方面是有效的,具有较低的性能开销(在 10% 以内)和可接受的召回率。此外,SFSD 在应用于大规模程序执行时表现出良好的可扩展性。

更新日期:2021-06-10
down
wechat
bug