Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems,Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences

当前位置： X-MOL 学术 › Proc. Royal Soc. A: Math. Phys. Eng. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems
Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences ( IF 2.9 ) Pub Date : 2020-11-01 , DOI: 10.1098/rspa.2020.0110
Azzam Haidar ₁ , Harun Bayraktar ₁ , Stanimire Tomov ₂ , Jack Dongarra _{2,

3,

4} , Nicholas J. Higham ₄

Affiliation

Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced-precision and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. The techniques we employ include multiprecision LU factorization, the preconditioned generalized minimal residual algorithm (GMRES), and scaling and auto-adaptive rounding to avoid overflow. We also show how to efficiently handle systems with multiple right-hand sides. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a 4×−5× performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.

中文翻译：

使用 GPU 上的张量核心进行混合精度迭代细化以加速线性系统的求解

几十年来，双精度浮点运算 (FP64) 一直是工程和科学模拟的实际标准。问题的复杂性以及来自各种仪器和传感器的大量数据促使研究人员混合搭配各种方法来优化计算资源，包括不同级别的浮点精度。近年来，机器学习推动了对半精度浮点运算的硬件支持。高性能计算的一个主要挑战是利用低精度和混合精度的硬件。我们展示了如何利用 NVIDIA GPU 上的 FP16/FP32 Tensor Cores 来加速线性方程组 Ax = b 的求解，而不会牺牲数值稳定性。我们采用的技术包括多精度 LU 分解，预处理广义最小残差算法 (GMRES)，以及缩放和自适应舍入以避免溢出。我们还展示了如何有效地处理具有多个右侧的系统。在 NVIDIA Quadro GV100 (Volta) GPU 上，与标准 FP64 实现相比，我们实现了 4 倍−5 倍的性能提升和 5 倍的能效，同时保持了 FP64 级别的数值稳定性。

更新日期：2020-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文