当前位置: X-MOL 学术IEEE Trans. Vis. Comput. Graph. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SpotSDC: Revealing the Silent Data Corruption Propagation in High-Performance Computing Systems
IEEE Transactions on Visualization and Computer Graphics ( IF 5.2 ) Pub Date : 2020-05-15 , DOI: 10.1109/tvcg.2020.2994954
Zhimin Li , Harshitha Menon , Dan Maljovec , Yarden Livnat , Shusen Liu , Kathryn Mohror , Peer-Timo Bremer , Valerio Pascucci

The trend of rapid technology scaling is expected to make the hardware of high-performance computing (HPC) systems more susceptible to computational errors due to random bit flips. Some bit flips may cause a program to crash or have a minimal effect on the output, but others may lead to silent data corruption (SDC), i.e., undetected yet significant output errors. Classical fault injection analysis methods employ uniform sampling of random bit flips during program execution to derive a statistical resiliency profile. However, summarizing such fault injection result with sufficient detail is difficult, and understanding the behavior of the fault-corrupted program is still a challenge. In this article, we introduce SpotSDC, a visualization system to facilitate the analysis of a program’s resilience to SDC. SpotSDC provides multiple perspectives at various levels of detail of the impact on the output relative to where in the source code the flipped bit occurs, which bit is flipped, and when during the execution it happens. SpotSDC also enables users to study the code protection and provide new insights to understand the behavior of a fault-injected program. Based on lessons learned, we demonstrate how what we found can improve the fault injection campaign method.

中文翻译:

SpotSDC:揭示高性能计算系统中的静默数据损坏传播

预计快速技术扩展的趋势将使高性能计算 (HPC) 系统的硬件更容易因随机位翻转而出现计算错误。某些位翻转可能会导致程序崩溃或对输出的影响很小,但其他位翻转可能会导致静默数据损坏 (SDC),即未检测到但显着的输出错误。经典的故障注入分析​​方法在程序执行期间采用随机位翻转的均匀采样来导出统计弹性配置文件。然而,足够详细地总结这样的故障注入结果是困难的,并且理解被故障损坏的程序的行为仍然是一个挑战。在本文中,我们介绍了 SpotSDC,这是一个可视化系统,用于分析程序对 SDC 的弹性。SpotSDC 提供了对输出影响的不同细节级别的多个视角,这些细节与源代码中发生翻转位的位置、翻转位以及在执行期间发生的时间有关。SpotSDC 还使用户能够研究代码保护并提供新的见解以了解错误注入程序的行为。根据吸取的经验教训,我们展示了我们的发现如何改进故障注入活动方法。
更新日期:2020-05-15
down
wechat
bug