H2Opus: a distributed-memory multi-GPU software package for non-local operators,Advances in Computational Mathematics

当前位置： X-MOL 学术 › Adv. Comput. Math. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

H2Opus: a distributed-memory multi-GPU software package for non-local operators
Advances in Computational Mathematics ( IF 1.7 ) Pub Date : 2022-05-10 , DOI: 10.1007/s10444-022-09942-6
Stefano Zampini ₁ , Wajih Boukaram ₁ , George Turkiyyah ₁ , Omar Knio ₁ , David Keyes ₁

Affiliation

Hierarchical \({\mathscr{H}}^{2}\)-matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their O(N) complexity in both memory and operator application makes them particularly suited for large-scale problems. As a result, there is a need for software that provides support for distributed operations on these matrices to allow large-scale problems to be represented. In this paper, we present high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the \({\mathscr{H}}^{2}\) format. The algorithms are a new module of H2Opus, a performance-oriented package that supports a broad variety of \({\mathscr{H}}^{2}\) matrix operations on CPUs and GPUs. Performance in the distributed GPU setting is achieved by marshaling the tree data of the hierarchical matrix representation to allow batched kernels to be executed on the individual GPUs. MPI is used for inter-process communication. We optimize the communication data volume and hide much of the communication cost with local compute phases of the algorithms. Results show near-ideal scalability up to 1024 NVIDIA V100 GPUs on Summit, with performance exceeding 2.3 Tflop/s/GPU for the matrix-vector multiplication, and 670 Gflop/s/GPU for matrix compression, which involves batched QR and SVD operations. We illustrate the flexibility and efficiency of the library by solving a 2D variable diffusivity integral fractional diffusion problem with an algebraic multigrid-preconditioned Krylov solver and demonstrate scalability up to 16M degrees of freedom problems on 64 GPUs.

中文翻译：

H2Opus：面向非本地运营商的分布式内存多 GPU 软件包

分层\({\mathscr{H}}^{2}\)矩阵是非局部算子离散化的渐近最优表示，例如在积分方程或核函数中产生的算子。它们在内存和运算符应用程序中的O ( N ) 复杂性使它们特别适合解决大规模问题。因此，需要为这些矩阵上的分布式操作提供支持以允许表示大规模问题的软件。在本文中，我们提出了高性能、分布式内存 GPU 加速算法和实现，用于在\({\mathscr{H}}^{2}\)中的分层矩阵的矩阵向量乘法和矩阵重新压缩。格式。这些算法是 H2Opus 的一个新模块，它是一个面向性能的包，支持多种\({\mathscr{H}}^{2}\)CPU 和 GPU 上的矩阵运算。分布式 GPU 设置中的性能是通过编组分层矩阵表示的树数据来实现的，以允许在单个 GPU 上执行批处理内核。MPI 用于进程间通信。我们优化了通信数据量，并通过算法的本地计算阶段隐藏了大部分通信成本。结果显示，Summit 上多达 1024 个 NVIDIA V100 GPU 的可扩展性接近理想，矩阵向量乘法性能超过 2.3 Tflop/s/GPU，矩阵压缩性能超过 670 Gflop/s/GPU，其中涉及批量 QR 和 SVD 操作。

更新日期：2022-05-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11