当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
GPU Tensor Cores for fast Arithmetic Reductions
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-01-01 , DOI: 10.1109/tpds.2020.3011893
Cristobal A. Navarro , Roberto Carrasco , Ricardo J. Barrientos , Javier A. Riquelme , Raimundo Vega

This article proposes a parallel algorithm for computing the arithmetic reduction of $n$n numbers as a set of matrix-multiply accumulate (MMA) operations that are executed simultaneously by GPU tensor cores. The analysis, assuming tensors of size $m \times m$m×m, shows that the proposed algorithm has a parallel running time of $T(n)=5 log_{m^2}{n}$T(n)=5logm2n and a speedup of $S=\frac{4}{5} log_{2}{m^2}$S=45log2m2 over a canonical parallel reduction. Experimental performance results on a Tesla V100 GPU show that the tensor-core based approach is energy efficient and runs up to $\sim 3.2 \times$3.2× and $2\times$2× faster than a standard GPU-based reduction and Nvidia's CUB library, respectively, while keeping the numerical error below 1 percent with respect to a double precision CPU reduction. The chained design of the algorithm allows a flexible configuration of GPU thread-blocks and the optimal values found through experimentation agree with the theoretical ones. The results obtained in this work show that GPU tensor cores are relevant not only for Deep Learning or Linear Algebra computations, but also for applications that require the acceleration of large summations.

中文翻译:

GPU Tensor Cores 用于快速算术简化

本文提出了一种并行算法来计算算术归约 $n$n数字作为一组矩阵乘法累加 (MMA) 操作,由 GPU 张量核心同时执行。分析,假设张量的大小$m \times m$×, 表明该算法的并行运行时间为 $T(n)=5 log_{m^2}{n}$(n)=5G2n 和加速 $S=\frac{4}{5} log_{2}{m^2}$=45G22在规范的并行减少上。在 Tesla V100 GPU 上的实验性能结果表明,基于张量核的方法是节能的,运行速度可达$\sim 3.2 \times$3.2×$2\times$2×分别比基于 GPU 的标准缩减和 Nvidia 的 CUB 库更快,同时将双精度 CPU 缩减的数值误差保持在 1% 以下。该算法的链式设计允许灵活配置 GPU 线程块,并且通过实验找到的最佳值与理论值一致。在这项工作中获得的结果表明,GPU 张量核心不仅与深度学习或线性代数计算相关,而且与需要加速大量求和的应用程序相关。
更新日期:2021-01-01
down
wechat
bug