当前位置: X-MOL 学术IEEE Trans. Nucl. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Analyzing DUE Errors on GPUs With Neutron Irradiation Test and Fault Injection to Control Flow
IEEE Transactions on Nuclear Science ( IF 1.9 ) Pub Date : 2021-07-21 , DOI: 10.1109/tns.2021.3098845
Kojiro Ito , Yangchao Zhang , Hiroaki Itsuji , Takumi Uezono , Tadanobu Toba , Masanori Hashimoto

As GPU applications expand, the reliability of GPU is drawing more attention since even reliability-demanding applications are executed on GPUs. Silent data corruption (SDC) is widely studied both in irradiation experiments and fault injection experiments. On the other hand, detectable uncorrected error (DUE) is not well studied. This work focuses on DUEs reported by the GPU driver and analyzes those observed in fault injection and neutron irradiation experiments, where faults are injected in the control flow to change the program counter value unexpectedly. The DUE errors of GPU engine exception, GPU memory page fault, and GPU processing stop are observed in both the experiments. On the other hand, the DUE error categorized as internal microcontroller halt by the GPU driver, which is not found in the fault injection experiment, is observed frequently, suggesting the necessity of investigating the failures originating from the faults in the components invisible to programmers.

中文翻译:


通过中子辐照测试和故障注入来控制流量来分析 GPU 上的 DUE 错误



随着GPU应用的扩展,GPU的可靠性越来越受到关注,因为即使是对可靠性要求较高的应用程序也是在GPU上执行的。静默数据损坏(SDC)在辐照实验和故障注入实验中得到了广泛的研究。另一方面,可检测的未纠正错误(DUE)尚未得到很好的研究。这项工作重点关注 GPU 驱动程序报告的 DUE,并分析在故障注入和中子辐照实验中观察到的 DUE,其中在控制流中注入故障以意外更改程序计数器值。两个实验中都观察到了 GPU 引擎异常、GPU 内存页面错误和 GPU 处理停止的 DUE 错误。另一方面,经常观察到被 GPU 驱动程序归类为内部微控制器停止的 DUE 错误(在故障注入实验中未发现),这表明有必要调查源自程序员不可见的组件故障的故障。
更新日期:2021-07-21
down
wechat
bug