A CUDA fast multipole method with highly efficient M2L far field evaluation,The International Journal of High Performance Computing Applications

当前位置： X-MOL 学术 › Int. J. High Perform. Comput. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A CUDA fast multipole method with highly efficient M2L far field evaluation
The International Journal of High Performance Computing Applications ( IF 3.5 ) Pub Date : 2020-10-12 , DOI: 10.1177/1094342020964857
Bartosz Kohnke ₁ , Carsten Kutzner ₁ , Andreas Beckmann ₂ , Gert Lube ₃ , Ivo Kabadshow ₂ , Holger Dachsel ₂ , Helmut Grubmüller ₁

Affiliation

Solving an N-body problem, electrostatic or gravitational, is a crucial task and the main computational bottleneck in many scientific applications. Its direct solution is an ubiquitous showcase example for the compute power of graphics processing units (GPUs). However, the naïve pairwise summation has O ( N 2 ) computational complexity. The fast multipole method (FMM) can reduce runtime and complexity to O ( N ) for any specified precision. Here, we present a CUDA-accelerated, C++ FMM implementation for multi particle systems with r − 1 potential that are found, e.g. in biomolecular simulations. The algorithm involves several operators to exchange information in an octree data structure. We focus on the Multipole-to-Local (M2L) operator, as its runtime is limiting for the overall performance. We propose, implement and benchmark three different M2L parallelization approaches. Approach (1) utilizes Unified Memory to minimize programming and porting efforts. It achieves decent speedups for only little implementation work. Approach (2) employs CUDA Dynamic Parallelism to significantly improve performance for high approximation accuracies. The presorted list-based approach (3) fits periodic boundary conditions particularly well. It exploits FMM operator symmetries to minimize both memory access and the number of complex multiplications. The result is a compute-bound implementation, i.e. performance is limited by arithmetic operations rather than by memory accesses. The complete CUDA parallelized FMM is incorporated within the GROMACS molecular dynamics package as an alternative Coulomb solver.

中文翻译：

一种具有高效 M2L 远场评估的 CUDA 快速多极方法

解决静电或引力的 N 体问题是一项至关重要的任务，也是许多科学应用中的主要计算瓶颈。它的直接解决方案是一个无处不在的展示图形处理单元 (GPU) 计算能力的示例。然而，朴素的成对求和的计算复杂度为 O ( N 2 )。对于任何指定的精度，快速多极方法 (FMM) 可以将运行时间和复杂性降低到 O ( N )。在这里，我们提出了一个 CUDA 加速的 C++ FMM 实现，用于具有 r - 1 势的多粒子系统，例如在生物分子模拟中。该算法涉及多个运算符以在八叉树数据结构中交换信息。我们专注于 Multipole-to-Local (M2L) 算子，因为它的运行时间限制了整体性能。我们建议，实施和基准测试三种不同的 M2L 并行化方法。方法 (1) 利用统一内存来最小化编程和移植工作。它只为很少的实现工作实现了不错的加速。方法 (2) 使用 CUDA 动态并行来显着提高高近似精度的性能。预先排序的基于列表的方法 (3) 特别适合周期性边界条件。它利用 FMM 算子对称性来最小化内存访问和复杂乘法的数量。结果是计算受限的实现，即性能受到算术运算而不是内存访问的限制。完整的 CUDA 并行 FMM 作为替代库仑求解器包含在 GROMACS 分子动力学包中。方法 (1) 利用统一内存来最小化编程和移植工作。它只为很少的实现工作实现了不错的加速。方法 (2) 使用 CUDA 动态并行来显着提高高近似精度的性能。基于预排序列表的方法 (3) 特别适合周期性边界条件。它利用 FMM 算子对称性来最小化内存访问和复杂乘法的数量。结果是计算受限的实现，即性能受到算术运算而不是内存访问的限制。完整的 CUDA 并行 FMM 作为替代库仑求解器包含在 GROMACS 分子动力学包中。方法 (1) 利用统一内存来最小化编程和移植工作。它只为很少的实现工作实现了不错的加速。方法 (2) 使用 CUDA 动态并行来显着提高高近似精度的性能。预先排序的基于列表的方法 (3) 特别适合周期性边界条件。它利用 FMM 算子对称性来最小化内存访问和复杂乘法的数量。结果是计算受限的实现，即性能受到算术运算而不是内存访问的限制。完整的 CUDA 并行 FMM 作为替代库仑求解器包含在 GROMACS 分子动力学包中。方法 (2) 使用 CUDA 动态并行来显着提高高近似精度的性能。预先排序的基于列表的方法 (3) 特别适合周期性边界条件。它利用 FMM 算子对称性来最小化内存访问和复杂乘法的数量。结果是计算受限的实现，即性能受到算术运算而不是内存访问的限制。完整的 CUDA 并行 FMM 作为替代库仑求解器包含在 GROMACS 分子动力学包中。方法 (2) 使用 CUDA 动态并行来显着提高高近似精度的性能。预先排序的基于列表的方法 (3) 特别适合周期性边界条件。它利用 FMM 算子对称性来最小化内存访问和复杂乘法的数量。结果是计算受限的实现，即性能受到算术运算而不是内存访问的限制。完整的 CUDA 并行 FMM 作为替代库仑求解器包含在 GROMACS 分子动力学包中。它利用 FMM 算子对称性来最小化内存访问和复杂乘法的数量。结果是计算受限的实现，即性能受到算术运算而不是内存访问的限制。完整的 CUDA 并行 FMM 作为替代库仑求解器包含在 GROMACS 分子动力学包中。它利用 FMM 算子对称性来最小化内存访问和复杂乘法的数量。结果是计算受限的实现，即性能受到算术运算而不是内存访问的限制。完整的 CUDA 并行 FMM 作为替代库仑求解器包含在 GROMACS 分子动力学包中。

更新日期：2020-10-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文