ICLA Unit: Intra-Cluster Locality-Aware Unit to Reduce L2 Access and NoC Pressure in GPGPUs,Journal of Circuits, Systems and Computers

当前位置： X-MOL 学术 › J. Circuits Syst. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ICLA Unit: Intra-Cluster Locality-Aware Unit to Reduce L2 Access and NoC Pressure in GPGPUs
Journal of Circuits, Systems and Computers ( IF 0.9 ) Pub Date : 2021-07-28 , DOI: 10.1142/s0218126622500153
Siamak Biglari Ardabili ₁ , Gholamreza Zare Fatin ₁

Affiliation

As the number of streaming multiprocessors (SMs) in GPUs increases, in order to gain better performance, the reply network faces heavy traffic. This causes congestion on Network-on-Chip (NoC) routers and memory controller’s (MC) buffers. By taking advantage of cooperative thread arrays (CTAs) that are scheduled locally in clusters, there is a high probability of finding the same copy of data in other SM’s L1 cache in the same cluster. In order to make this feasible, it is necessary for the SMs to have access to local L1 cache of the neighboring SMs. There is a considerable congestion in NoC due to unique traffic pattern called many-to-few-to-many. Thanks to the reduced number of requests that is attained by our proposed Intra-Cluster Locality-Aware (ICLA) unit, this congested replying network traffic becomes many-to-many traffic pattern and the replied data goes through the less-utilized core-to-core communication that mitigates the NoC traffic. The proposed architecture in this paper has been evaluated using 15 different workloads from CUDA SDK, Rodinia, and ISPASS2009 benchmarks. The proposed ICLA unit has been modeled and simulated in the GPGPU-Sim. The results show about 23.79% (up to 49.82%) reduction in average network latency, 15.49% (up to 36.82%) reduction in average L2 cache access, and 18.18% (up to 58.1%) average improvement in the instruction per cycle (IPC).

中文翻译：

ICLA 单元：集群内局部感知单元，用于减少 GPGPU 中的 L2 访问和 NoC 压力

随着 GPU 中流式多处理器 (SM) 数量的增加，为了获得更好的性能，回复网络面临着巨大的流量。这会导致片上网络 (NoC) 路由器和内存控制器 (MC) 缓冲区出现拥塞。通过利用在集群中本地调度的协作线程阵列 (CTA)，很有可能在其他 SM 中找到相同的数据副本大号1缓存在同一个集群中。为了使这可行，SM 必须能够访问本地大号1相邻SM的缓存。由于称为多对多对多的独特流量模式，NoC 存在相当大的拥塞。由于我们提出的集群内本地感知 (ICLA) 单元减少了请求数量，这种拥挤的回复网络流量变成了多对多流量模式，并且回复的数据通过使用较少的核心到- 减轻 NoC 流量的核心通信。本文提出的架构已使用来自 CUDA SDK、Rodinia 和 ISPASS2009 基准的 15 种不同工作负载进行了评估。建议的 ICLA 单元已在 GPGPU-Sim 中进行建模和仿真。结果显示平均网络延迟降低约 23.79%（最高 49.82%），平均降低 15.49%（最高 36.82%）大号2缓存访问，以及每周期指令 (IPC) 的平均改进 18.18%（最高 58.1%）。

更新日期：2021-07-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文