Matrix multiplication on batches of small matrices in half and half-complex precisions,Journal of Parallel and Distributed Computing

当前位置： X-MOL 学术 › J. Parallel Distrib. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Matrix multiplication on batches of small matrices in half and half-complex precisions
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2020-07-15 , DOI: 10.1016/j.jpdc.2020.07.001
Ahmad Abdelfattah , Stanimire Tomov , Jack Dongarra

Machine learning and artificial intelligence (AI) applications often rely on performing many small matrix operations—in particular general matrix–matrix multiplication (GEMM). These operations are usually performed in a reduced precision, such as the 16-bit floating-point format (i.e., half precision or FP16). The GEMM operation is also very important for dense linear algebra algorithms, and half-precision GEMM operations can be used in mixed-precision linear solvers. Therefore, high-performance batched GEMM operations in reduced precision are significantly important, not only for deep learning frameworks, but also for scientific applications that rely on batched linear algebra, such as tensor contractions and sparse direct solvers.

This paper presents optimized batched GEMM kernels for graphics processing units (GPUs) in FP16 arithmetic. The paper addresses both real and complex half-precision computations on the GPU. The proposed design takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. With eight tuning parameters introduced in the design, the developed kernels have a high degree of flexibility that overcomes the limitations imposed by the hardware and software (in the form of discrete configurations for the Tensor Core APIs). For real FP16 arithmetic, performance speedups are observed against cuBLAS for sizes up to 128, and range between $1.5 \times$ and $2.5 \times$ . For the complex FP16 GEMM kernel, the speedups are between $1.7 \times$ and $7 \times$ thanks to a design that uses the standard interleaved matrix layout, in contrast with the planar layout required by the vendor’s solution. The paper also discusses special optimizations for extremely small matrices, where even higher performance gains are achievable.

中文翻译：

以小和半复杂的精度对一批小矩阵进行矩阵乘法

机器学习和人工智能（AI）应用程序通常依赖于执行许多小的矩阵运算，尤其是通用矩阵-矩阵乘法（GEMM）。这些操作通常以降低的精度执行，例如16位浮点格式（即，半精度或FP16）。GEMM运算对于密集线性代数算法也非常重要，半精度GEMM运算可用于混合精度线性求解器。因此，降低精度的高性能批处理GEMM运算非常重要，这不仅对于深度学习框架，而且对于依赖于批线性代数的科学应用（例如张量收缩和稀疏直接求解器）都非常重要。

本文针对FP16算法中的图形处理单元（GPU）提出了优化的批处理GEMM内核。本文介绍了GPU上的实数和复数半精度计算。拟议的设计利用了最近在启用CUDA的GPU中引入的Tensor Core技术。通过在设计中引入八个调整参数，开发的内核具有高度的灵活性，可以克服硬件和软件（以Tensor Core API的离散配置形式）施加的限制。对于真正的FP16算法，对于大小最大为128的cuBLAS，可以观察到性能提升，范围介于 $1个。 5 \times$ 和 $2 。 5 \times$ 。对于复杂的FP16 GEMM内核，加速比介于 $1个。 7 \times$ 和 $7 \times$ 得益于采用标准交错矩阵布局的设计，与供应商解决方案所需的平面布局形成对比。本文还讨论了针对极小矩阵的特殊优化，其中甚至可以实现更高的性能增益。

更新日期：2020-07-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11