Enabling Software Resilience in GPGPU Applications via Partial Thread Protection,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Enabling Software Resilience in GPGPU Applications via Partial Thread Protection
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2021-03-04 , DOI: arxiv-2103.02825
Lishan Yang, Bin Nie, Adwait Jog, Evgenia Smirni

Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. By taking advantage of a general purpose GPU application hierarchical organization in threads, warps, and cooperative thread arrays, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. This allows engaging partial replication mechanisms for error detection/correction at the warp level. By exploring 12 benchmarks (17 kernels) from 4 benchmark suites, we illustrate that threads can be remapped into reliable or unreliable warps with only 1.63% introduced overhead (on average), and then enable selective protection via replication to those groups of threads that truly need it. Furthermore, we show that thread remapping to different warps does not sacrifice application performance. We show how this remapping facilitates warp replication for error detection and/or correction and achieves an average reduction of 20.61% and 27.15% execution cycles, respectively comparing to standard duplication/triplication.

中文翻译：

通过部分线程保护在GPGPU应用程序中实现软件弹性

图形处理单元（GPU）在各种领域中被各种应用程序广泛使用，以加速它们的计算，但仍然容易受到短暂的硬件故障（软错误）的影响，这些故障很容易损害应用程序的输出。通过利用线程，扭曲和协作线程数组中的通用GPU应用程序分层组织，我们提出了一种方法，该方法可以识别线程的弹性，并旨在将具有相同弹性特性的线程映射到相同的扭曲。这允许采用部分复制机制以在翘曲级别进行错误检测/纠正。通过研究来自4个基准测试套件的12个基准测试（17个内核），我们说明了线程可以重新映射为可靠或不可靠的扭曲，而引入的开销仅为平均1.63％，然后通过复制到真正需要它的那些线程组来启用选择性保护。此外，我们证明了将线程重新映射到不同的线程束不会牺牲应用程序的性能。我们展示了这种重新映射如何促进用于错误检测和/或纠正的翘曲复制，并且与标准的复制/复制相比分别实现了20.61％和27.15％的执行周期平均减少。

更新日期：2021-03-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>