当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Repurposing GPU Microarchitectures with Light-Weight Out-Of-Order Execution
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-06-29 , DOI: 10.1109/tpds.2021.3093231
Konstantinos Iliakis , Sotirios Xydis , Dimitrios Soudris

GPU is the dominant platform for accelerating general-purpose workloads due to its computing capacity and cost-efficiency. GPU applications cover an ever-growing range of domains. To achieve high throughput, GPUs rely on massive multi-threading and fast context switching to overlap computations with memory operations. We observe that among the diverse GPU workloads, there exists a significant class of kernels that fail to maintain a sufficient number of active warps to hide the latency of memory operations, and thus suffer from frequent stalling. We argue that the dominant Thread-Level Parallelism model is not enough to efficiently accommodate the variability of modern GPU applications. To address this inherent inefficiency, we propose a novel micro-architecture with lightweight Out-Of-Order execution capability enabling Instruction-Level Parallelism to complement the conventional Thread-Level Parallelism model. To minimize the hardware overhead, we carefully design our extension to highly re-use the existing micro-architectural structures and study various design trade-offs to contain the overall area and power overhead, while providing improved performance. We show that the proposed architecture outperforms traditional platforms by 23 percent on average for low-occupancy kernels, with an area and power overhead of 1.29 and 10.05 percent, respectively. Finally, we establish the potential of our proposal as a micro-architecture alternative by providing 16 percent speedup over a wide collection of 60 general-purpose kernels.

中文翻译:

通过轻量级乱序执行重新利用 GPU 微架构

由于其计算能力和成本效益,GPU 是加速通用工作负载的主要平台。GPU 应用程序涵盖的领域范围不断扩大。为了实现高吞吐量,GPU 依靠大规模多线程和快速上下文切换来重叠计算与内存操作。我们观察到,在不同的 GPU 工作负载中,存在一类重要的内核,它们无法维持足够数量的活动扭曲来隐藏内存操作的延迟,从而导致频繁停顿。我们认为主要的线程级并行模型不足以有效地适应现代 GPU 应用程序的可变性。为了解决这种固有的低效率,我们提出了一种具有轻量级乱序执行能力的新型微架构,使指令级并行能够补充传统的线程级并行模型。为了最大限度地减少硬件开销,我们精心设计了我们的扩展,以高度重用现有的微架构结构,并研究各种设计权衡以控制整体面积和功率开销,同时提供改进的性能。我们表明,对于低占用内核,所提出的架构比传统平台平均高出 23%,面积和功率开销分别为 1.29% 和 10.05%。最后,通过在 60 个通用内核的广泛集合上提供 16% 的加速,我们确立了我们的提案作为微架构替代方案的潜力。
更新日期:2021-07-27
down
wechat
bug