当前位置: X-MOL 学术Comput. Phys. Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A GPU-accelerated fast multipole method based on barycentric Lagrange interpolation and dual tree traversal
Computer Physics Communications ( IF 7.2 ) Pub Date : 2021-05-04 , DOI: 10.1016/j.cpc.2021.108017
Leighton Wilson , Nathan Vaughn , Robert Krasny

We present a GPU-accelerated fast multipole method (FMM) called BLDTT, which uses barycentric Lagrange interpolation for the near-field and far-field approximations, and dual tree traversal to construct the interaction lists. The scheme replaces well-separated particle-particle interactions by adaptively chosen particle-cluster, cluster-particle, and cluster-cluster approximations given by barycentric Lagrange interpolation on a Chebyshev grid of proxy particles in each cluster. The BLDTT employs FMM-type upward and downward passes, although here they are adapted to interlevel polynomial interpolation. The BLDTT is kernel-independent, and the approximations have a direct sum form that efficiently maps onto GPUs, where targets provide an outer level of parallelism and sources provide an inner level of parallelism. The code uses OpenACC directives for GPU acceleration and MPI remote memory access for distributed memory parallelization. Computations are presented for different particle distributions, domains, and interaction kernels, and for unequal targets and sources. The BLDTT consistently outperforms our earlier particle-cluster barycentric Lagrange treecode (BLTC). On a single GPU for problem size ranging from N=1E5 to 1E8, the BLTC scales like O(NlogN) and the BLDTT scales like O(N). We also present MPI strong scaling results for the BLDTT and BLTC with N=64E6 particles running on 1 to 32 GPUs.



中文翻译:

基于重心拉格朗日插值和对偶树遍历的GPU加速快速多极点方法

我们提出了一种称为BLDTT的GPU加速快速多极方法(FMM),该方法将重心拉格朗日插值用于近场和远场逼近,并使用双树遍历来构造交互列表。该方案通过在每个簇中的代理颗粒的Chebyshev网格上由重心拉格朗日内插给出的自适应选择的颗粒-簇,簇-颗粒和簇-簇近似值来替换分离良好的颗粒-颗粒之间的相互作用。BLDTT采用FMM类型的向上和向下通过,尽管此处它们适用于层间多项式插值。BLDTT与内核无关,并且近似值具有直接和形式,可以有效地映射到GPU,其中目标提供了外部级别的并行性源提供了内部的并行度。该代码使用OpenACC指令进行GPU加速,并使用MPI远程内存访问进行分布式内存并行化。提出了针对不同粒子分布,域和交互作用内核以及不相等的目标和源的计算。BLDTT始终优于我们之前的粒子集群重心拉格朗日树码(BLTC)。在问题大小范围从N= 1E5到1E8的单个GPU上,BLTC的缩放比例为Øñ日志ñ 而BLDTT的规模像 Øñ。我们还针对N = 64E6粒子在1至32个GPU上运行的BLDTT和BLTC展示了MPI强大的缩放结果。

更新日期:2021-05-10
down
wechat
bug