Automatic Kernel Generation for Volta Tensor Cores,arXiv - CS - Programming Languages

当前位置： X-MOL 学术 › arXiv.cs.PL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Automatic Kernel Generation for Volta Tensor Cores
arXiv - CS - Programming Languages Pub Date : 2020-06-22 , DOI: arxiv-2006.12645
Somashekaracharya G. Bhaskaracharya, Julien Demouth, Vinod Grover

A commonly occurring computation idiom in neural networks is to perform some pointwise operations on the result of a matrix multiplication. Such a sequence of operations is typically represented as a computation graph in deep learning compilers. When compiling to a GPU target, these computations can be individually mapped to manually tuned implementations provided by libraries such as cuBLAS and cuDNN. These libraries also provide off-the-shelf support for targeting tensor cores in NVIDIA GPUs, which can lead to huge performance boosts through their specialized support for mixed-precision matrix math. Alternatively, tensor cores can be programmed directly using CUDA APIs or inline assembly instructions, which opens up the possibility of generating efficient CUDA kernels automatically for such computations. Automatic kernel generation is particularly crucial when it is beneficial to generate efficient code for an entire computation graph by fusing several operations into a single device function instead of invoking a separate kernel for each of them. Polyhedral compilation techniques provide a systematic approach for the analysis and transformation of a sequence of affine loop-nests. In this paper, we describe a polyhedral approach to generate efficient CUDA kernels for matrix multiplication using inline assembly instructions for programming tensor cores on NVIDIA Volta GPUs. Furthermore, we build on this approach to generate fused kernels for computation sequences involving matrix multiplication and pointwise operations such as bias addition, ReLU activation etc. Experimental evaluation of these techniques show that automatically generated kernels can provide significantly better performance than manually tuned library implementations, with speedups ranging up to 2.55X.

中文翻译：

Volta 张量核心的自动内核生成

神经网络中常见的计算习惯用法是对矩阵乘法的结果执行一些逐点运算。这样的操作序列通常表示为深度学习编译器中的计算图。当编译到 GPU 目标时，这些计算可以单独映射到由库（例如 cuBLAS 和 cuDNN）提供的手动调整的实现。这些库还为 NVIDIA GPU 中的目标张量内核提供现成的支持，通过对混合精度矩阵数学的专门支持，可以极大地提高性能。或者，可以使用 CUDA API 或内联汇编指令直接对张量核心进行编程，从而为此类计算自动生成高效的 CUDA 内核开辟了可能性。当通过将多个操作融合到单个设备函数中而不是为每个操作调用单独的内核来为整个计算图生成高效代码时，自动内核生成尤其重要。多面体编译技术为分析和转换仿射循环嵌套序列提供了一种系统方法。在本文中，我们描述了一种多面体方法，该方法使用内联汇编指令为 NVIDIA Volta GPU 上的张量内核编程，从而为矩阵乘法生成高效的 CUDA 内核。此外，我们基于这种方法为涉及矩阵乘法和逐点运算（如偏置加法、ReLU 激活等）的计算序列生成融合内核。

更新日期：2020-08-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文