当前位置: X-MOL 学术IEEE Micro › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Countering load-to-use stalls in the NVIDIA Turing GPU
IEEE Micro ( IF 3.6 ) Pub Date : 2020-11-01 , DOI: 10.1109/mm.2020.3012514
Ram Rangan 1 , Naman Turakhia 1 , Alexandre Joly 1
Affiliation  

Among its various improvements over prior NVIDIA GPUs, the NVIDIA Turing GPU boasts of four key performance enhancements to effectively counter memory load-to-use stalls. First, reduced latency on L1 hits for global memory loads helps lower average memory lookup latency. Next, the ability to dynamically configure the L1 data RAM between cacheable memory and scratchpad or shared memory, enables driver software to opportunistically maximize L1 data cache size for programs with low shared memory requirements, increasing L1 hits and reducing load-to-use stalls. Finally, the twin enhancements of doubling of vector register file capacity and the addition of a dedicated scalar or uniform register file along with a uniform datapath, ease vector register pressure and enable higher warp level parallelism, leading to better latency tolerance. We find that the above enhancements combined deliver an average speedup of 11% on modern gaming applications.

中文翻译:

解决 NVIDIA Turing GPU 中的加载到使用停滞

在对之前 NVIDIA GPU 的各种改进中,NVIDIA Turing GPU 拥有四项关键性能增强功能,可有效应对内存加载到使用停滞。首先,减少全局内存负载的 L1 命中延迟有助于降低平均内存查找延迟。其次,在可缓存内存和暂存器或共享内存之间动态配置 L1 数据 RAM 的能力,使驱动程序软件能够为共享内存要求较低的程序机会性地最大化 L1 数据缓存大小,增加 L1 命中并减少加载到使用停顿。最后,向量寄存器文件容量加倍和添加专用标量或统一寄存器文件以及统一数据路径的双重增强,减轻了向量寄存器压力并实现更高的扭曲级并行性,从而获得更好的延迟容限。
更新日期:2020-11-01
down
wechat
bug