Optimizing the hypre solver for manycore and GPU architectures,Journal of Computational Science

当前位置： X-MOL 学术 › Int. J. Comput. Sci. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimizing the hypre solver for manycore and GPU architectures
Journal of Computational Science ( IF 3.1 ) Pub Date : 2020-12-24 , DOI: 10.1016/j.jocs.2020.101279
Damodar Sahasrabudhe , Rohit Zambre , Aparna Chandramowlishwaran , Martin Berzins

The solution of large-scale combustion problems with codes such as Uintah on modern computer architectures requires the use of multithreading and GPUs to achieve performance. Uintah uses a low-Mach number approximation that requires iteratively solving a large system of linear equations. The Hypre iterative solver has solved such systems in a scalable way for Uintah, but the use of OpenMP with Hypre leads to at least $2 \times$ slowdown due to OpenMP overheads. The proposed solution uses the MPI Endpoints within Hypre, where each team of threads acts as a different MPI rank. This approach minimizes OpenMP synchronization overhead and performs as fast or (up to 1.44 $\times$ ) faster than Hypre's MPI-only version, and allows the rest of Uintah to be optimized using OpenMP. The profiling of the GPU version of Hypre shows the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro-kernels and was further optimized by using Cuda-aware MPI, resulting in an overall speedup of 1.16—1.44 $\times$ compared to the baseline GPU implementation.

The above optimization strategies were published in the International Conference on Computational Science 2020 [1]. This work extends the previously published research by carrying out the second phase of communication-centered optimizations in Hypre to improve its scalability on large-scale supercomputers. This includes an efficient non-blocking inter-thread communication scheme, communication-reducing patch assignment, and expression of logical communication parallelism to a new version of the MPICH library that utilizes the underlying network parallelism [2]. The above optimizations avoid communication bottlenecks previously observed during strong scaling and improve performance by up to 2 $\times$ on 256 nodes of Intel Knight's Landing processor.

中文翻译：

针对许多核心和GPU架构优化Hypre求解器

在现代计算机体系结构上使用诸如Uintah之类的代码来解决大规模燃烧问题，需要使用多线程和GPU来实现性能。Uintah使用低马赫数近似值，需要迭代求解大型线性方程组。Hypre迭代求解器已针对Uintah以可扩展的方式解决了此类系统，但是将OpenMP与Hypre配合使用至少可以 $2 \times$ 由于OpenMP开销而导致速度下降。提出的解决方案使用Hypre中的MPI端点，其中每个线程组都充当不同的MPI等级。这种方法最大程度地减少了OpenMP同步开销，并且执行速度达到或快（高达1.44 $\times$ ）比Hypre的仅MPI版本更快，并且可以使用OpenMP优化其余的Uintah。Hypre的GPU版本的性能分析表明瓶颈是成千上万个微内核的启动开销。通过融合这些微内核提高了GPU性能，并通过使用支持Cuda的MPI进一步优化了GPU性能，从而使整体速度提高了1.16-1.44 $\times$ 与基准GPU实施相比。

上述优化策略已在2020年国际计算科学大会上发表[1]。这项工作通过在Hypre中进行以通讯为中心的优化的第二阶段来扩展先前发布的研究，以提高其在大型超级计算机上的可伸缩性。这包括高效的非阻塞线程间通信方案，减少通信的补丁分配以及将逻辑通信并行性表达为利用底层网络并行性的新版本的MPICH库[2]。上面的优化避免了以前在强扩展期间观察到的通信瓶颈，并将性能提高了2倍。 $\times$ 在Intel Knight's Landing处理器的256个节点上。

更新日期：2021-01-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11