当前位置: X-MOL 学术Concurr. Comput. Pract. Exp. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi‐GPU performance optimization of a computational fluid dynamics code using OpenACC
Concurrency and Computation: Practice and Experience ( IF 1.5 ) Pub Date : 2020-09-28 , DOI: 10.1002/cpe.6036
Weicheng Xue 1 , Christoper J. Roy 1
Affiliation  

This article investigates the multi‐GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on multiple platforms. The article shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs due to the noncontiguous memory access. The performance using whatever decompositions can be benefited from a series of performance optimizations in the article. Since the buoyancy driven cavity code is communication‐bounded on the clusters examined, a series of optimizations both agnostic and tailored to the platforms are designed to reduce the communication cost and improve memory throughput between hosts and devices efficiently. First, the parallel message packing/unpacking strategy developed for noncontiguous data movement between hosts and devices improves the overall performance by about a factor of 2. Second, transferring different data based on the stencil sizes for different variables further reduces the communication overhead. These two optimizations are general enough to be beneficial to stencil computations having ghost exchanges. Third, GPUDirect is used to improve the communication on clusters which have the hardware and software support for direct communication between GPUs without staging the memory of CPU. Finally, overlapping the communication and computations is shown to be not efficient on multi‐GPUs if only using MPI or MPI+OpenACC. Although we believe our implementation has revealed enough communication and computation overlap, the actual running does not utilize the overlap well due to a lack of enough asynchronous progression.

中文翻译:

使用OpenACC对计算流体动力学代码进行多GPU性能优化

本文研究了在多个平台上使用MPI和OpenACC指令的3D浮力驱动型腔求解器的多GPU性能。该文章表明,分解不同维度的整体问题会显着影响GPU的强大缩放性能。如果没有适当的性能优化,则表明由于不连续的内存访问,一维域分解在多个GPU上的缩放比例很差。使用任何分解的性能都可以从本文中的一系列性能优化中受益。由于受浮力驱动的空腔代码在检查的群集上受通信限制,因此,针对平台进行了一系列不可知论和量身定制的优化,旨在降低通信成本并有效提高主机和设备之间的内存吞吐量。第一的,为主机和设备之间的不连续数据移动而开发的并行消息打包/拆包策略将整体性能提高了约2倍。其次,基于用于不同变量的模板大小传输不同的数据进一步减少了通信开销。这两个优化足够通用,足以有益于具有重影交换的模具计算。第三,GPUDirect用于改善群集上的通信,这些群集具有硬件和软件支持,可以在GPU之间直接通信,而无需暂存CPU的内存。最后,如果仅使用MPI或MPI + OpenACC,则重叠通信和计算在多GPU上效率不高。尽管我们认为我们的实施已显示出足够的通信和计算重叠,
更新日期:2020-09-28
down
wechat
bug