当前位置:
X-MOL 学术
›
arXiv.cs.PL
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs
arXiv - CS - Programming Languages Pub Date : 2020-03-13 , DOI: arxiv-2003.06324 Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, Vinod Grover
arXiv - CS - Programming Languages Pub Date : 2020-03-13 , DOI: arxiv-2003.06324 Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, Vinod Grover
Achieving high-performance GPU kernels requires optimizing algorithm
implementations to the targeted GPU architecture. It is of utmost importance to
fully use the compute and memory hierarchy, as well as available specialised
hardware. Currently, vendor libraries like cuBLAS and cuDNN provide the best
performing implementations of GPU algorithms. However the task of the library
programmer is incredibly challenging: for each provided algorithm,
high-performance implementations have to be developed for all commonly used
architectures, input sizes, and different storage formats. These
implementations are generally provided as optimized assembly code because
performance-critical architectural features are only exposed at this level.
This prevents reuse between different implementations of even the same
algorithm, as simple differences can have major effects on low-level
implementation details. In this paper we introduce Fireiron, a DSL and compiler
which allows the specification of high-performance GPU implementations as
compositions of simple and reusable building blocks. We show how to use
Fireiron to optimize matrix multiplication implementations, achieving
performance matching hand-coded CUDA kernels, even when using specialised
hardware such as NIVIDA Tensor Cores, and outperforming state-of-the-art
implementations provided by cuBLAS by more than 2x.
中文翻译:
Fireiron:GPU 上高性能线性代数的调度语言
实现高性能 GPU 内核需要针对目标 GPU 架构优化算法实现。充分利用计算和内存层次结构以及可用的专用硬件至关重要。目前,cuBLAS 和 cuDNN 等供应商库提供了性能最佳的 GPU 算法实现。然而,库程序员的任务非常具有挑战性:对于每个提供的算法,必须为所有常用架构、输入大小和不同的存储格式开发高性能实现。这些实现通常作为优化的汇编代码提供,因为性能关键的架构特性仅在此级别公开。这可以防止即使是相同算法的不同实现之间的重用,因为简单的差异会对低级实现细节产生重大影响。在本文中,我们介绍了 Fireiron,一种 DSL 和编译器,它允许将高性能 GPU 实现规范为简单且可重用的构建块的组合。我们展示了如何使用 Fireiron 来优化矩阵乘法实现,实现与手动编码的 CUDA 内核匹配的性能,即使在使用 NIVIDA Tensor Cores 等专用硬件时也是如此,并且性能比 cuBLAS 提供的最先进的实现高出 2 倍以上。
更新日期:2020-03-16
中文翻译:
Fireiron:GPU 上高性能线性代数的调度语言
实现高性能 GPU 内核需要针对目标 GPU 架构优化算法实现。充分利用计算和内存层次结构以及可用的专用硬件至关重要。目前,cuBLAS 和 cuDNN 等供应商库提供了性能最佳的 GPU 算法实现。然而,库程序员的任务非常具有挑战性:对于每个提供的算法,必须为所有常用架构、输入大小和不同的存储格式开发高性能实现。这些实现通常作为优化的汇编代码提供,因为性能关键的架构特性仅在此级别公开。这可以防止即使是相同算法的不同实现之间的重用,因为简单的差异会对低级实现细节产生重大影响。在本文中,我们介绍了 Fireiron,一种 DSL 和编译器,它允许将高性能 GPU 实现规范为简单且可重用的构建块的组合。我们展示了如何使用 Fireiron 来优化矩阵乘法实现,实现与手动编码的 CUDA 内核匹配的性能,即使在使用 NIVIDA Tensor Cores 等专用硬件时也是如此,并且性能比 cuBLAS 提供的最先进的实现高出 2 倍以上。