当前位置: X-MOL 学术Front. Inform. Technol. Electron. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimizing non-coalesced memory access for irregular applications with GPU computing
Frontiers of Information Technology & Electronic Engineering ( IF 3 ) Pub Date : 2020-09-17 , DOI: 10.1631/fitee.1900262
Ran Zheng , Yuan-dong Liu , Hai Jin

General purpose graphics processing units (GPGPUs) can be used to improve computing performance considerably for regular applications. However, irregular memory access exists in many applications, and the benefits of graphics processing units (GPUs) are less substantial for irregular applications. In recent years, several studies have presented some solutions to remove static irregular memory access. However, eliminating dynamic irregular memory access with software remains a serious challenge. A pure software solution without hardware extensions or offline profiling is proposed to eliminate dynamic irregular memory access, especially for indirect memory access. Data reordering and index redirection are suggested to reduce the number of memory transactions, thereby improving the performance of GPU kernels. To improve the efficiency of data reordering, an operation to reorder data is offloaded to a GPU to reduce overhead and thus transfer data. Through concurrently executing the compute unified device architecture (CUDA) streams of data reordering and the data processing kernel, the overhead of data reordering can be reduced. After these optimizations, the volume of memory transactions can be reduced by 16.7%–50% compared with CUSPARSE-based benchmarks, and the performance of irregular kernels can be improved by 9.64%–34.9% using an NVIDIA Tesla P4 GPU.



中文翻译:

利用GPU计算为非常规应用程序优化非批量内存访问

通用图形处理单元(GPGPU)可用于为常规应用程序显着提高计算性能。但是,在许多应用程序中存在不规则的内存访问,并且图形处理单元(GPU)的好处对于不规则的应用而言并不那么重要。近年来,一些研究提出了一些解决方案,以消除静态的不规则内存访问。但是,用软件消除动态不规则内存访问仍然是一个严峻的挑战。提出了一种没有硬件扩展或脱机分析的纯软件解决方案,以消除动态不规则内存访问,尤其是对于间接内存访问。建议进行数据重新排序和索引重定向以减少内存事务的数量,从而提高GPU内核的性能。为了提高数据重新排序的效率,将重新排序数据的操作分流到GPU,以减少开销并因此传输数据。通过同时执行数据重新排序的计算统一设备体系结构(CUDA)流和数据处理内核,可以减少数据重新排序的开销。经过这些优化后,与基于CUSPARSE的基准测试相比,内存事务量可以减少16.7%–50%,使用NVIDIA Tesla P4 GPU,不规则内核的性能可以提高9.64%–34.9%。

更新日期:2020-09-17
down
wechat
bug