当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor Cores
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-06-29 , DOI: 10.1109/tpds.2021.3093239
Nhut-Minh Ho , Weng-Fai Wong

Driven by the demands of deep learning, many hardware accelerators, including GPUs, have begun to include specialized tensor processing units to accelerate matrix operations. However, general-purpose GPU applications that have little or no large dense matrix operations cannot benefit from these tensor units. This article proposes Tensorox, a framework that exploits the half-precision tensor cores available on recent GPUs for approximable, non deep learning applications. In essence, a shallow neural network is trained based on the input-output mapping of the function to be approximated. The key innovation in our implementation is the use of the small and dimension-restricted tensor operations in Nvidia GPUs to run multiple instances of the approximation neural network in parallel. With the proper scaling and training methods, our approximation yielded an overall accuracy that is higher than naïvely running the original programs with half-precision. Furthermore, Tensorox allows for the runtime adjustment of the degree of approximation. For the 10 benchmarks we tested, we achieved speedups from 2× to 112× compared to the original in single precision floating point, while maintaining the error caused by the approximation to below 10 percent in most applications.

中文翻译:

Tensorox:通过未使用的张量核心上的神经逼近来加速 GPU 应用程序

在深度学习需求的推动下,包括 GPU 在内的许多硬件加速器已经开始包含专门的张量处理单元来加速矩阵运算。但是,具有很少或没有大型密集矩阵运算的通用 GPU 应用程序无法从这些张量单元中受益。本文提出了 Tensorox,这是一个利用最新 GPU 上可用的半精度张量核心的框架,用于近似的非深度学习应用程序。本质上,浅层神经网络是基于要逼近的函数的输入-输出映射来训练的。我们实现中的关键创新是使用 Nvidia GPU 中的小且维度受限的张量运算来并行运行近似神经网络的多个实例。通过适当的缩放和训练方法,我们的近似产生了比以半精度天真地运行原始程序更高的整体精度。此外,Tensorox 允许对近似度进行运行时调整。对于我们测试的 10 个基准测试,与原始单精度浮点数相比,我们实现了 2 倍到 112 倍的加速,同时在大多数应用中将近似值引起的误差保持在 10% 以下。
更新日期:2021-07-27
down
wechat
bug