Block red–black MILU(0) preconditioner with relaxation on GPU,Parallel Computing

当前位置： X-MOL 学术 › Parallel Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Block red–black MILU(0) preconditioner with relaxation on GPU
Parallel Computing ( IF 1.4 ) Pub Date : 2021-02-19 , DOI: 10.1016/j.parco.2021.102760
Akemi Shioya , Yusaku Yamamoto

To accelerate the Krylov subspace-based linear equation solvers on Graphics Processing Units (GPUs), a stable, efficient and highly parallel preconditioner is essential. One of the strong candidates for such a preconditioner is the combination of the block red–black ordering and the relaxed modified incomplete LU factorization without fill-ins (MILU(0)). In this paper, we present techniques for implementing this type of preconditioner on General-purpose computing on GPU (GPGPU) using OpenACC. Our implementation is designed for 3-dimensional finite-difference computations with 7-point stencil, and the matrix storage format is optimized to realize coalesced memory access. Also, mixed-precision computation is employed to exploit the high single-precision performance of GPUs without sacrificing the accuracy of the computed solution. Extensive numerical tests were performed and the optimal values of various tunable parameters such as the number of blocks in each direction and the number of workers specified in OpenACC clauses are discussed. Performance comparison on NVIDIA Quadro GP100 and Tesla K40t GPUs shows that our solver is much faster than existing libraries like cuSPARSE, MAGMA, ViennaCL, and Ginkgo, especially when multiple linear equations with coefficient matrices sharing the same nonzero pattern are solved.

中文翻译：

通过在GPU上放松来阻止红黑MILU（0）预调节器

为了在图形处理单元（GPU）上加速基于Krylov子空间的线性方程求解器，稳定，高效和高度并行的预处理器是必不可少的。此类预处理器的强力候选者之一是块红黑排序与宽松的经过修改的不完全LU分解（无填充）的组合（MILU（0））。在本文中，我们介绍了使用OpenACC在GPU上的通用计算（GPGPU）上实现这种类型的预处理器的技术。我们的实现针对具有7点模板的3维有限差分计算而设计，并且矩阵存储格式经过优化以实现合并的存储器访问。同样，采用混合精度计算来利用GPU的高单精度性能，而不会牺牲计算出的解决方案的准确性。进行了广泛的数值测试，并讨论了各种可调参数的最佳值，例如每个方向上的块数和OpenACC子句中指定的工作程序数。在NVIDIA Quadro GP100和Tesla K40t GPU上的性能比较表明，我们的求解器比诸如cuSPARSE，MAGMA，ViennaCL和Ginkgo之类的现有库要快得多，特别是当求解具有共享相同非零模式的系数矩阵的多个线性方程式时。

更新日期：2021-02-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>