当前位置: X-MOL 学术arXiv.cs.PL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs
arXiv - CS - Programming Languages Pub Date : 2020-03-13 , DOI: arxiv-2003.06324
Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, Vinod Grover

Achieving high-performance GPU kernels requires optimizing algorithm implementations to the targeted GPU architecture. It is of utmost importance to fully use the compute and memory hierarchy, as well as available specialised hardware. Currently, vendor libraries like cuBLAS and cuDNN provide the best performing implementations of GPU algorithms. However the task of the library programmer is incredibly challenging: for each provided algorithm, high-performance implementations have to be developed for all commonly used architectures, input sizes, and different storage formats. These implementations are generally provided as optimized assembly code because performance-critical architectural features are only exposed at this level. This prevents reuse between different implementations of even the same algorithm, as simple differences can have major effects on low-level implementation details. In this paper we introduce Fireiron, a DSL and compiler which allows the specification of high-performance GPU implementations as compositions of simple and reusable building blocks. We show how to use Fireiron to optimize matrix multiplication implementations, achieving performance matching hand-coded CUDA kernels, even when using specialised hardware such as NIVIDA Tensor Cores, and outperforming state-of-the-art implementations provided by cuBLAS by more than 2x.

中文翻译:

Fireiron:GPU 上高性能线性代数的调度语言

实现高性能 GPU 内核需要针对目标 GPU 架构优化算法实现。充分利用计算和内存层次结构以及可用的专用硬件至关重要。目前,cuBLAS 和 cuDNN 等供应商库提供了性能最佳的 GPU 算法实现。然而,库程序员的任务非常具有挑战性:对于每个提供的算法,必须为所有常用架构、输入大小和不同的存储格式开发高性能实现。这些实现通常作为优化的汇编代码提供,因为性能关键的架构特性仅在此级别公开。这可以防止即使是相同算法的不同实现之间的重用,因为简单的差异会对低级实现细节产生重大影响。在本文中,我们介绍了 Fireiron,一种 DSL 和编译器,它允许将高性能 GPU 实现规范为简单且可重用的构建块的组合。我们展示了如何使用 Fireiron 来优化矩阵乘法实现,实现与手动编码的 CUDA 内核匹配的性能,即使在使用 NIVIDA Tensor Cores 等专用硬件时也是如此,并且性能比 cuBLAS 提供的最先进的实现高出 2 倍以上。
更新日期:2020-03-16
down
wechat
bug