当前位置: X-MOL 学术arXiv.cs.MS › 论文详情
Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs
arXiv - CS - Mathematical Software Pub Date : 2019-05-08 , DOI: arxiv-1905.03136
Dominik Ernst; Georg Hager; Jonas Thies; Gerhard Wellein

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA's current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.
更新日期:2020-02-19

 

全部期刊列表>>
如何通过Nature平台传播科研成果
跟Nature、Science文章学绘图
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
中洪博元
ACS材料视界
x-mol收录
南开大学
朱守非
廖良生
郭东升
汪铭
伊利诺伊大学香槟分校
徐明华
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug