Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-bit Faults,IEEE Transactions on Computers

当前位置： X-MOL 学术 › IEEE Trans. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-bit Faults
IEEE Transactions on Computers ( IF 3.7 ) Pub Date : 2021-01-01 , DOI: 10.1109/tc.2020.2980541
Lishan Yang , Bin Nie , Adwait Jog , Evgenia Smirni

Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU (GPGPU) applications is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space in the order of billions, even for some simple applications and even when considering the occurrence of just a single-bit fault. We present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience becomes practical. The key insight behind our proposed methodology stems from the fact that while GPGPU applications spawn a lot of threads, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by careful analysis. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be primarily classified based on the number of the dynamic instructions they execute. We therefore achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, a subset of loop iterations within the representative threads, and a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications. We show the effectiveness of the proposed progressive pruning technique for a single-bit model and illustrate its application to even more challenging cases with three distinct multi-bit fault models.

中文翻译：

存在单位和多位故障时 GPGPU 应用程序的实际弹性分析

图形处理单元 (GPU) 已迅速发展，可以为广泛的科学领域实现节能数据并行计算。虽然 GPU 在严格的功率预算下实现了百亿亿级性能，但它们也容易受到软错误的影响，这通常是由高能粒子撞击引起的，这会显着影响应用程序的输出质量。理解通用 GPU (GPGPU) 应用程序的弹性尤其具有挑战性，因为与主要是单线程的 CPU 应用程序不同，GPGPU 应用程序可以包含数百到数千个线程，从而导致数十亿级的巨大故障现场空间，即使对于一些简单的应用程序，甚至在考虑仅发生一位故障时。我们提出了一种逐步修剪故障站点空间的系统方法，旨在显着减少故障注入的数量，从而使对 GPGPU 应用程序错误恢复能力的评估变得可行。我们提出的方法背后的关键见解源于这样一个事实，即虽然 GPGPU 应用程序产生大量线程，但其中许多线程执行相同的指令集。因此，几个故障站点是冗余的，可以通过仔细分析进行修剪。我们从 Rodinia 和 Polybench 套件中确定了一组 10 个应用程序（16 个内核）的重要特征，并得出结论，线程可以主要根据它们执行的动态指令的数量进行分类。因此，我们通过仅分析代表 GPGPU 应用程序的动态指令行为（以及因此错误恢复行为）的一小部分线程来实现显着的故障站点减少。通过识别此代表性线程集合内的代码块之间的动态指令共性（和差异）、代表性线程内的循环迭代子集以及目标寄存器位位置的子集来实现进一步修剪。上述步骤导致故障站点的数量大幅减少多达七个数量级。然而，这种减少的故障站点空间准确地捕获了 GPGPU 应用程序的错误恢复配置文件。

更新日期：2021-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>