High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results
arXiv - CS - Performance Pub Date : 2021-08-23 , DOI: arxiv-2108.13191
Navdeep Katel, Vivek Khandelwal, Uday Bondhugula

This report presents some early results on code generation targeting tensor cores on NVIDIA GPUs using the MLIR compiler infrastructure. The state-of-the-art in high-performance deep learning today is primarily driven by manually optimized highly tuned libraries. The approach to develop such libraries is often not modular or reusable to the same extent that compiler infrastructure like LLVM is. Manual optimization typically does not use a standard intermediate representation (IR), although the optimizations performed can be encoded as a sequence of transformation steps and customized passes on an IR. Hand tuning may also miss exploration of design points only reachable easily by automatic code generation. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), IR infrastructure was not geared to tackle the problem of automatic generation of domain-specific libraries in an effective manner. In particular, it was hard to represent and transform compute abstractions at high, middle, and low levels using a single IR. With suitable abstractions in MLIR, we build an experimental lowering pipeline that is able to automatically generate code for matrix-matrix multiplication on NVIDIA GPUs targeting its tensor cores. On a set of problem sizes we evaluated, initial performance results show that we are able to attain performance that is 95-119% and 80-160% of CuBLAS for FP32 and FP16 accumulate respectively on NVIDIA's Ampere microarchitecture-based Geforce 3090 RTX. We believe that these results could be used as motivation for further research and development on automatic code and library generation using IR infrastructure for similar specialized accelerators.

中文翻译：

使用 MLIR 为矩阵-矩阵乘法生成高性能 GPU 代码：一些早期结果

本报告展示了使用 MLIR 编译器基础设施在 NVIDIA GPU 上针对张量核心生成代码的一些早期结果。当今最先进的高性能深度学习主要由手动优化的高度调整库驱动。开发此类库的方法通常不像 LLVM 之类的编译器基础设施那样模块化或可重用。手动优化通常不使用标准的中间表示 (IR)，尽管执行的优化可以编码为一系列转换步骤和 IR 上的自定义传递。手动调整也可能会错过对只能通过自动代码生成轻松访问的设计点的探索。我们认为，直到最近推出 MLIR（多级中间表示），IR 基础设施并未以有效的方式解决自动生成特定领域库的问题。特别是，很难使用单个 IR 在高、中和低级别表示和转换计算抽象。通过 MLIR 中的适当抽象，我们构建了一个实验性降低管道，该管道能够在 NVIDIA GPU 上针对其张量核心自动生成矩阵-矩阵乘法的代码。在我们评估的一组问题大小上，初始性能结果表明，我们能够在 NVIDIA 基于 Ampere 微架构的 Geforce 3090 RTX 上分别获得 FP32 和 FP16 累积的 CuBLAS 的 95-119% 和 80-160% 的性能。

更新日期：2021-08-31

点击分享查看原文

点击收藏

阅读更多本刊最新论文