当前位置: X-MOL 学术arXiv.cs.CE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
GPU acceleration of CaNS for massively-parallel direct numerical simulations of canonical fluid flows
arXiv - CS - Computational Engineering, Finance, and Science Pub Date : 2020-01-15 , DOI: arxiv-2001.05234
Pedro Costa, Everett Phillips, Luca Brandt and Massimiliano Fatica

This work presents the GPU acceleration of the open-source code CaNS for very fast massively-parallel simulations of canonical fluid flows. The distinct feature of the many-CPU Navier-Stokes solver in CaNS is its fast direct solver for the second-order finite-difference Poisson equation, based on the method of eigenfunction expansions. The solver implements all the boundary conditions valid for this type of problems in a unified framework. Here, we extend the solver for GPU-accelerated clusters using CUDA Fortran. The porting makes extensive use of CUF kernels and has been greatly simplified by the unified memory feature of CUDA Fortran, which handles the data migration between host (CPU) and device (GPU) without defining new arrays in the source code. The overall implementation has been validated against benchmark data for turbulent channel flow and its performance assessed on a NVIDIA DGX-2 system (16 Tesla V100 32Gb, connected with NVLink via NVSwitch). The wall-clock time per time step of the GPU-accelerated implementation is impressively small when compared to its CPU implementation on state-of-the-art many-CPU clusters, as long as the domain partitioning is sufficiently small that the data resides mostly on the GPUs. The implementation has been made freely available and open-source under the terms of an MIT license.

中文翻译:

用于典型流体流动的大规模并行直接数值模拟的 CaNS GPU 加速

这项工作展示了开源代码 CaNS 的 GPU 加速,用于对规范流体流动进行非常快速的大规模并行模拟。CaNS 中多 CPU Navier-Stokes 求解器的显着特点是其基于特征函数展开方法的二阶有限差分泊松方程的快速直接求解器。求解器在统一框架中实现了对此类问题有效的所有边界条件。在这里,我们使用 CUDA Fortran 扩展了 GPU 加速集群的求解器。移植大量使用了 CUF 内核,并通过 CUDA Fortran 的统一内存特性大大简化,该特性处理主机 (CPU) 和设备 (GPU) 之间的数据迁移,而无需在源代码中定义新数组。整体实施已根据湍流通道流的基准数据进行验证,并在 NVIDIA DGX-2 系统(16 Tesla V100 32Gb,通过 NVSwitch 与 NVLink 连接)上评估其性能。与它在最先进的多 CPU 集群上的 CPU 实现相比,GPU 加速实现的每个时间步的挂钟时间非常小,只要域分区足够小,数据主要驻留在 GPU 上。该实现已根据 MIT 许可条款免费提供和开源。与它在最先进的多 CPU 集群上的 CPU 实现相比,GPU 加速实现的每个时间步的挂钟时间非常小,只要域分区足够小,数据主要驻留在 GPU 上。该实现已根据 MIT 许可条款免费提供和开源。与它在最先进的多 CPU 集群上的 CPU 实现相比,GPU 加速实现的每个时间步的挂钟时间非常小,只要域分区足够小,数据主要驻留在 GPU 上。该实现已根据 MIT 许可条款免费提供和开源。
更新日期:2020-10-05
down
wechat
bug