Integrating state of the art compute, communication, and autotuning strategies to multiply the performance of ab initio molecular dynamics on massively parallel multi-core supercomputers,Computer Physics Communications

当前位置： X-MOL 学术 › Comput. Phys. Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Integrating state of the art compute, communication, and autotuning strategies to multiply the performance of ab initio molecular dynamics on massively parallel multi-core supercomputers
Computer Physics Communications ( IF 6.3 ) Pub Date : 2020-11-29 , DOI: 10.1016/j.cpc.2020.107745
Tobias Klöffel , Gerald Mathias , Bernd Meyer

The development in today’s supercomputer hardware is that the compute power of the individual nodes grows much faster than the speed of their interconnects. To benefit from this evolution in computer hardware, the challenge in modernization of simulation software is to increase the computational load on the nodes and to reduce simultaneously the inter-node communication. Here, we demonstrate the implementation of such a strategy for plane-wave based electronic structure methods and ab initio molecular dynamics (AIMD) simulations. Our focus is on ultra-soft pseudopotentials (USPP), since they allow to shift workload from fast Fourier transforms (FFTs) to highly node-efficient matrix–matrix multiplications. For communication intensive routines, as the multiple distributed 3- $d$ FFTs of the electronic states and the distributed matrix–matrix multiplications related to the $β$ -projectors of the pseudopotentials, parallel MPI+OpenMP algorithms are revised to make use of overlapping computation and communication. The necessary partitioning of the workload is optimized by auto-tuning algorithms. In addition, the largest global MPI_Allreduce operation is replaced by highly tuned node-local parallelized operations using MPI shared-memory windows to avoid inter-node communication. A batched algorithm for the multiple 3- $d$ FFTs improves the throughput of the MPI_Alltoall communication and, thus, the scalability of the implementation, both for USPP and for frequently used norm-conserving pseudopotentials. The new algorithms have been implemented in the AIMD program CPMD (www.cpmd.org). The enhanced performance and scalability of the code is demonstrated on simulations of liquid water with up to 2048 molecules. It is shown that 100 ps simulations with many hundred water molecules can be done now routinely within a few days on a moderate number of nodes.

中文翻译：

在大规模并行多核超级计算机上集成先进的计算，通信和自动调整策略，以提高从头算分子动力学的性能

当今超级计算机硬件的发展是，各个节点的计算能力增长快于其互连的速度。为了受益于计算机硬件的这种发展，仿真软件现代化的挑战是增加节点上的计算负荷并同时减少节点间的通信。在这里，我们演示了基于平面波的电子结构方法和从头算分子动力学（AIMD）模拟的这种策略的实现。我们的重点是超软伪势（USPP），因为它们可以将工作量从快速傅里叶变换（FFT）转移到节点效率高的矩阵矩阵乘法。对于通信密集型例程，因为多个分布式3 $d$ 电子状态的FFT和与矩阵相关的分布式矩阵-矩阵乘法 $β$ -对伪电位的投影仪，并行MPI + OpenMP算法进行了修改，以利用重叠的计算和通信。通过自动调整算法优化了工作负载的必要分区。此外，最大的全局MPI_Allreduce操作被使用MPI共享内存窗口的高度优化的节点本地并行化操作所取代，以避免节点间的通信。多个3的批处理算法 $d$ FFT提高了MPI_Alltoall通信的吞吐量，并因此提高了USPP和常用的规范守恒伪势的实现的可伸缩性。新算法已在AIMD程序CPMD（www.cpmd.org）中实现。通过对多达2048个分子的液态水进行仿真，证明了该代码增强的性能和可伸缩性。结果表明，现在可以在几天内在中等数量的节点上例行完成数百个水分子的100 ps仿真。

更新日期：2020-12-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>