Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs,arXiv - CS - Mathematical Software

当前位置： X-MOL 学术 › arXiv.cs.MS › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs
arXiv - CS - Mathematical Software Pub Date : 2019-05-08 , DOI: arxiv-1905.03136
Dominik Ernst, Georg Hager, Jonas Thies, Gerhard Wellein

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA's current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.

中文翻译：

GPU 上真实和复杂的 Tall & Skinny 矩阵乘法内核的性能工程

供应商提供的 BLAS 库中具有双精度实数和复数条目（DGEMM 和 ZGEMM）的一般矩阵-矩阵乘法最适合方阵，但对于高比宽的高和瘦矩阵通常表现出较差的性能。在这种情况下，NVIDIA 当前的 CUBLAS 实施仅提供了屋顶线模型所指示的潜在性能的一小部分。我们描述了可以实现接近最佳性能的实现的挑战和关键特征。我们进一步评估了并行化和线程分布的不同策略，并设计了一个灵活的、可配置的映射方案。为了确保灵活性并允许高度定制的实现，我们使用代码生成与自动调整相结合。

更新日期：2020-06-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>