当前位置: X-MOL 学术J. Comput. Phys. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A high-throughput hybrid task and data parallel Poisson solver for large-scale simulations of incompressible turbulent flows on distributed GPUs
Journal of Computational Physics ( IF 3.8 ) Pub Date : 2021-04-02 , DOI: 10.1016/j.jcp.2021.110329
Hadi Zolfaghari , Dominik Obrist

The solution of the pressure Poisson equation arising in the numerical solution of incompressible Navier–Stokes equations (INSE) is by far the most expensive part of the computational procedure, and often the major restricting factor for parallel implementations. Improvements in iterative linear solvers, e.g. deploying Krylov-based techniques and multigrid preconditioners, have been successfully applied for solving the INSE on CPU-based parallel computers. These numerical schemes, however, do not necessarily perform well on GPUs, mainly due to differences in the hardware architecture. Our previous work using many P100 GPUs of a flagship supercomputer showed that porting a highly optimized MPI-parallel CPU-based INSE solver to GPUs, accelerates significantly the underlying numerical algorithms, while the overall acceleration remains limited (Zolfaghari et al. [3]). The performance loss was mainly due to the Poisson solver, particularly the V-cycle geometric multigrid preconditioner. We also observed that the pure compute time for the GPU kernels remained nearly constant as grid size was increased. Motivated by these observations, we present herein an algebraically simpler, yet more advanced parallel implementation for the solution of the Poisson problem on large numbers of distributed GPUs. Data parallelism is achieved by using the classical Jacobi method with successive over-relaxation and an optimized iterative driver routine. Task parallelism is enhanced via minimizing GPU-GPU data exchanges as iterations proceed to reduce the communication overhead. The hybrid parallelism results in nearly 300 times less time-to-solution and thus computational cost (measured in node-hours) for the Poisson problem, compared to our best-case scenario CPU-based parallel implementation which uses a preconditioned BiCGstab method. The Poisson solver is then embedded in a flow solver with explicit third-order Runge-Kutta scheme for time-integration, which has been previously ported to GPUs. The flow solver is validated and computationally benchmarked for the transition and decay of the Taylor-Green Vortex at Re=1600 and the flow around a solid sphere at ReD=3700. Good strong scaling is demonstrated for both benchmarks. Further, nearly 70% lower electrical energy consumption than the CPU implementation is reported for Taylor-Green vortex case. We finally deploy the solver for DNS of systolic flow in a bileaflet mechanical heart valve, and present new insight into the complex laminar-turbulent transition process in this prosthesis.



中文翻译:

高通量混合任务和数据并行泊松求解器,用于分布式GPU上不可压缩湍流的大规模仿真

不可压缩的Navier–Stokes方程(INSE)的数值解中产生的压力泊松方程的解是迄今为止计算过程中最昂贵的部分,并且通常是并行实现的主要限制因素。迭代线性求解器的改进(例如,部署基于Krylov的技术和多网格预处理器)已成功应用于解决基于CPU的并行计算机上的INSE。但是,这些数字方案不一定在GPU上表现良好,这主要是由于硬件体系结构的差异。我们以前使用旗舰超级计算机的许多P100 GPU进行的工作表明,将高度优化的基于MPI并行CPU的INSE解算器移植到GPU,可以大大加速基础的数值算法,而总体加速度仍然有限​​(Zolfaghari等人[3])。性能损失主要归因于泊松求解器,尤其是V周期几何多网格预处理器。我们还观察到,随着网格大小的增加,GPU内核的纯计算时间几乎保持不变。出于这些观察的动机,我们在此提出了一种代数上更简单但更高级的并行实现,用于解决大量分布式GPU上的泊松问题。通过使用具有连续超松弛和优化的迭代驱动程序例程的经典Jacobi方法,可以实现数据并行性。随着迭代的进行,通过最大限度地减少GPU-GPU数据交换来增强任务并行性,以减少通信开销。与我们的最佳方案基于CPU的并行环境(使用预处理的BiCGstab方法)相比,混合并行性导致解决Poisson问题的时间缩短了近300倍,因此计算成本(以节点小时为单位)。然后将Poisson求解器嵌入到具有明确的用于时间积分的三阶Runge-Kutta方案的流求解器中,该方案先前已移植到GPU。流动解算器已针对泰勒-格林涡旋的跃迁和衰减进行了验证和计算基准测试 先前已移植到GPU。流动解算器已针对泰勒-格林涡旋的跃迁和衰减进行了验证和计算基准测试 先前已移植到GPU。流动解算器已针对泰勒-格林涡旋的跃迁和衰减进行了验证和计算基准测试[RË=1600 和围绕实心球的流动 [RËd=3700。这两个基准都展示了良好的强伸缩性。此外,据报道,泰勒-格林涡旋情况比CPU实施的能耗降低了近70%。我们最终在双叶机械心脏瓣膜中部署了用于收缩压DNS的求解器,并提出了对这种假体中复杂的层流湍流过渡过程的新见解。

更新日期:2021-04-09
down
wechat
bug