当前位置: X-MOL 学术Concurr. Comput. Pract. Exp. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Incorporating selective victim cache into GPGPU for high-performance computing
Concurrency and Computation: Practice and Experience ( IF 1.5 ) Pub Date : 2017-03-03 , DOI: 10.1002/cpe.4104
Jianfei Wang 1 , Fengfeng Fan 1 , Li Jiang 1 , Xiaoyao Liang 1 , Naifeng Jing 1
Affiliation  

Contemporary general‐purpose graphic processing units (GPGPUs) successfully parallelize an application into thousands of concurrent threads with remarkably improved performance. Such massive threads will compete for the small‐sized first‐level data (L1D) cache, leading to an exaggerated cache‐thrashing problem, which may degrade the overall performance significantly. In this paper, we propose a selective victim cache design to enable better data locality and higher performance. Instead of a small fully associative structure, we first redesign the victim cache as a set associative structure that is equivalent to the original L1D cache to suit the GPGPU applications with massive concurrent threads. To keep the mostly used data in L1D for better operand service, we apply a simple prediction scheme to avoid costly block interchanges and evictions. To further save the area for data storage, we propose to leverage the unallocated registers and shared memory entries to hold the victim cache data. The experiments demonstrate that our proposed approach can increase the on‐chip data cache hit rate considerably and deliver a better performance with negligible changes to the baseline GPGPU architecture. For example, our selective victim cache design can improve the performance by 41.3% on average, achieving 54.7% increase in data cache hit rate and 21.8% reduction in block interchanges and evictions.

中文翻译:

将选择性受害者缓存纳入 GPGPU 以实现高性能计算

当代通用图形处理单元 (GPGPU) 成功地将应用程序并行化为数千个并发线程,并显着提高了性能。如此庞大的线程会争夺小规模的一级数据(L1D)缓存,导致缓存抖动问题被夸大,这可能会显着降低整体性能。在本文中,我们提出了一种选择性受害者缓存设计,以实现更好的数据局部性和更高的性能。我们首先将牺牲缓存重新设计为与原始 L1D 缓存等效的集合关联结构,而不是小型的全关联结构,以适应具有大量并发线程的 GPGPU 应用程序。为了将最常用的数据保留在 L1D 中以获得更好的操作数服务,我们应用了一个简单的预测方案来避免代价高昂的块交换和驱逐。为了进一步节省数据存储区域,我们建议利用未分配的寄存器和共享内存条目来保存受害者缓存数据。实验表明,我们提出的方法可以显着提高片上数据缓存命中率,并提供更好的性能,而对基线 GPGPU 架构的变化可以忽略不计。例如,我们的选择性牺牲缓存设计可以将性能平均提高 41.3%,实现数据缓存命中率提高 54.7%,块交换和驱逐减少 21.8%。实验表明,我们提出的方法可以显着提高片上数据缓存命中率,并提供更好的性能,而对基线 GPGPU 架构的变化可以忽略不计。例如,我们的选择性牺牲缓存设计可以将性能平均提高 41.3%,实现数据缓存命中率提高 54.7%,块交换和驱逐减少 21.8%。实验表明,我们提出的方法可以显着提高片上数据缓存命中率,并提供更好的性能,而对基线 GPGPU 架构的变化可以忽略不计。例如,我们的选择性牺牲缓存设计可以将性能平均提高 41.3%,实现数据缓存命中率提高 54.7%,块交换和驱逐减少 21.8%。
更新日期:2017-03-03
down
wechat
bug