当前位置: X-MOL 学术Comput. Phys. Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards extreme scale dissipative particle dynamics simulations using multiple GPGPUs
Computer Physics Communications ( IF 6.3 ) Pub Date : 2020-06-01 , DOI: 10.1016/j.cpc.2020.107159
Jony Castagna , Xiaohu Guo , Michael Seaton , Alan O’Cais

Abstract A multi-GPGPU development for Mesoscale Simulations using the Dissipative Particle Dynamics method is presented. This distributed GPU acceleration development is an extension of the DL_MESO package to MPI+CUDA in order to exploit the computational power of the latest NVIDIA cards on hybrid CPU–GPU architectures. Details about the extensively applicable algorithm implementation and memory coalescing data structures are presented. The key algorithms’ optimizations for the nearest-neighbour list searching of particle pairs for short range forces, exchange of data and overlapping between computation and communications are also given. We have carried out strong and weak scaling performance analyses with up to 4096 GPUs. A two phase mixture separation test case with 1.8 billion particles has been run on the Piz Daint supercomputer from the Swiss National Supercomputer Center. With CUDA aware MPI, proper GPU affinity, communication and computation overlap optimizations for multi-GPU version, the final optimization results demonstrated more than 94% efficiency for weak scaling and more than 80% efficiency for strong scaling. As far as we know, this is the first report in the literature of DPD simulations being run on this large number of GPUs. The remaining challenges and future work are also discussed at the end of the paper.

中文翻译:

使用多个 GPGPU 进行极端尺度耗散粒子动力学模拟

摘要 介绍了使用耗散粒子动力学方法进行中尺度模拟的多 GPGPU 开发。这种分布式 GPU 加速开发是 DL_MESO 包对 MPI+CUDA 的扩展,以利用最新 NVIDIA 卡在混合 CPU-GPU 架构上的计算能力。详细介绍了广泛适用的算法实现和内存合并数据结构。还给出了针对短程力、数据交换以及计算和通信之间重叠的粒子对最近邻列表搜索的关键算法优化。我们对多达 4096 个 GPU 进行了强弱扩展性能分析。两相混合物分离测试用例 1。瑞士国家超级计算机中心的 Piz Daint 超级计算机已经运行了 80 亿个粒子。借助 CUDA 感知 MPI、适当的 GPU 亲和性、多 GPU 版本的通信和计算重叠优化,最终优化结果证明弱扩展的效率超过 94%,强扩展的效率超过 80%。据我们所知,这是在大量 GPU 上运行 DPD 模拟的文献中的第一份报告。本文末尾还讨论了剩余的挑战和未来的工作。据我们所知,这是在大量 GPU 上运行 DPD 模拟的文献中的第一份报告。本文末尾还讨论了剩余的挑战和未来的工作。据我们所知,这是在大量 GPU 上运行 DPD 模拟的文献中的第一份报告。本文末尾还讨论了剩余的挑战和未来的工作。
更新日期:2020-06-01
down
wechat
bug