GPU Tensor Cores for fast Arithmetic Reductions,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

GPU Tensor Cores for fast Arithmetic Reductions
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-01-01 , DOI: 10.1109/tpds.2020.3011893
Cristobal A. Navarro , Roberto Carrasco , Ricardo J. Barrientos , Javier A. Riquelme , Raimundo Vega

This article proposes a parallel algorithm for computing the arithmetic reduction of

$n$

numbers as a set of matrix-multiply accumulate (MMA) operations that are executed simultaneously by GPU tensor cores. The analysis, assuming tensors of size

$m \times m$

m×m

, shows that the proposed algorithm has a parallel running time of

$T(n)=5 log_{m^2}{n}$

T(n)=5logm2n

and a speedup of

$S=\frac{4}{5} log_{2}{m^2}$

S=45log2m2

over a canonical parallel reduction. Experimental performance results on a Tesla V100 GPU show that the tensor-core based approach is energy efficient and runs up to

$\sim 3.2 \times$

∼3.2×

and

$2\times$

2×

faster than a standard GPU-based reduction and Nvidia's CUB library, respectively, while keeping the numerical error below 1 percent with respect to a double precision CPU reduction. The chained design of the algorithm allows a flexible configuration of GPU thread-blocks and the optimal values found through experimentation agree with the theoretical ones. The results obtained in this work show that GPU tensor cores are relevant not only for Deep Learning or Linear Algebra computations, but also for applications that require the acceleration of large summations.

中文翻译：

GPU Tensor Cores 用于快速算术简化

本文提出了一种并行算法来计算算术归约

$n$

数字作为一组矩阵乘法累加 (MMA) 操作，由 GPU 张量核心同时执行。分析，假设张量的大小

$m \times m$

米×米

, 表明该算法的并行运行时间为

$T(n)=5 log_{m^2}{n}$

吨(n)=5升○G米2n

和加速

$S=\frac{4}{5} log_{2}{m^2}$

秒=45升○G2米2

在规范的并行减少上。在 Tesla V100 GPU 上的实验性能结果表明，基于张量核的方法是节能的，运行速度可达

$\sim 3.2 \times$

～3.2×

和

$2\times$

2×

分别比基于 GPU 的标准缩减和 Nvidia 的 CUB 库更快，同时将双精度 CPU 缩减的数值误差保持在 1% 以下。该算法的链式设计允许灵活配置 GPU 线程块，并且通过实验找到的最佳值与理论值一致。在这项工作中获得的结果表明，GPU 张量核心不仅与深度学习或线性代数计算相关，而且与需要加速大量求和的应用程序相关。

更新日期：2021-01-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11