Energy-Efficient GPU L2 Cache Design Using Instruction-Level Data Locality Similarity,ACM Transactions on Design Automation of Electronic Systems

当前位置： X-MOL 学术 › ACM Trans. Des. Autom. Electron. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Energy-Efficient GPU L2 Cache Design Using Instruction-Level Data Locality Similarity
ACM Transactions on Design Automation of Electronic Systems ( IF 2.2 ) Pub Date : 2020-08-18 , DOI: 10.1145/3408060
Jingweijia Tan ₁ , Kaige Yan ₁ , Shuaiwen Leon Song ₂ , Xin Fu ₃

Affiliation

This article presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all of the streaming multiprocessors is not the primary performance bottleneck, but it does consume a large amount of chip energy. We observe that L2 cache is significantly underutilized by spending 95.6% of the time storing useless data. If such “dead time” on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Lo cality S imilarity to build an energy-efficient GPU L2 Cache , named LoSCache . Specifically, LoSCache uses the data locality information from a small group of cooperative thread arrays to dynamically predict the L2-level data re-reference counts of the remaining cooperative thread arrays. After that, specific L2 cache lines can be powered off if they are predicted to be “dead” after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss. In addition, LoSCache is cost effective, independent of the scheduling policies, and compatible with the state-of-the-art L1 cache designs for additional energy savings.

中文翻译：

使用指令级数据局部性相似性的节能 GPU L2 缓存设计

本文介绍了一种新颖的节能缓存设计，适用于 GPU 等大规模并行、面向吞吐量的架构。与现代 GPU 上的 L1 数据缓存不同，所有流式多处理器共享的 L2 缓存并不是主要的性能瓶颈，但它确实会消耗大量的芯片能量。我们观察到，L2 缓存的利用率显着不足，因为它花费了 95.6% 的时间来存储无用的数据。如果 L2 上的这种“死区时间”被识别并减少，L2 的能源效率可以大大提高。幸运的是，我们发现 GPU 的 SIMT 编程模型在线程中提供了一个独特的特性：指令级数据局部性相似性，可用于准确预测 L2 缓存块级的数据重新引用计数。我们提出了一个利用这一点的简单设计罗城市小号构建节能 GPU L2 的相似性缓存, 命名缓存. 具体来说，LoSCache 使用来自一小组协作线程数组的数据局部性信息来动态预测剩余协作线程数组的 L2 级数据重引用计数。之后，如果特定 L2 缓存行在某些访问后被预测为“死”，则可以关闭它们。广泛应用的实验结果表明，我们提出的设计可以显着降低二级缓存能量平均 64%，而性能损失仅为 0.5%。此外，LoSCache 具有成本效益，独立于调度策略，并且与最先进的 L1 缓存设计兼容，可进一步节省能源。

更新日期：2020-08-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11