当前位置: X-MOL 学术Comput. Phys. Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PittPack: An open-source Poisson’s equation solver for extreme-scale computing with accelerators
Computer Physics Communications ( IF 7.2 ) Pub Date : 2020-09-01 , DOI: 10.1016/j.cpc.2020.107272
Jaber J. Hasbestan , Cheng-Nian Xiao , Inanc Senocak

We present a parallel implementation of a direct solver for the Poisson's equation on extreme-scale supercomputers with accelerators. We introduce a chunked-pencil decomposition as the domain-decomposition strategy to distribute work among processing elements to achieve superior scalability at large number of accelerators. Chunked-pencil decomposition enables overlapping nodal communication and data transfer between the central processing units (CPUs) and the graphics processing units (GPUs). Second, it improves data locality by keeping neighboring elements in adjacent memory locations. Third, it allows usage of shared-memory for certain segments of the algorithm when possible, and last but not least, it enables contiguous message transfer among the nodes. Two different communication patterns are designed. The fist pattern aims to fully overlap the communication with data transfer and designed for speedup of overall turnaround time, whereas the second method concentrates on low memory usage and is more network friendly for computations at extreme scale. To ensure software portability, we interleave OpenACC with MPI in the software. The numerical solution and its formal second order of accuracy is verified using method of manufactured solutions for various combinations of boundary conditions. Weak scaling analysis is performed using up to 1.1 trillion Cartesian mesh points using 16384 GPUs on a petascale leadership class supercomputer.

中文翻译:

PittPack:开源泊松方程求解器,用于带加速器的超大规模计算

我们在具有加速器的超大规模超级计算机上提出了泊松方程直接求解器的并行实现。我们引入了分块铅笔分解作为域分解策略,以在处理元素之间分配工作,以在大量加速器上实现卓越的可扩展性。分块铅笔分解使中央处理单元 (CPU) 和图形处理单元 (GPU) 之间的重叠节点通信和数据传输成为可能。其次,它通过将相邻元素保持在相邻的内存位置来提高数据局部性。第三,它允许在可能的情况下为算法的某些部分使用共享内存,最后但并非最不重要的是,它支持节点之间的连续消息传输。设计了两种不同的通信模式。第一种模式旨在将通信与数据传输完全重叠,旨在加快整体周转时间,而第二种方法专注于低内存使用率,并且对极端规模的计算更加网络友好。为了确保软件的可移植性,我们在软件中将 OpenACC 与 MPI 交织在一起。数值解及其形式的二阶精度使用为各种边界条件组合制造解的方法进行验证。弱缩放分析是在千万亿级领先级超级计算机上使用 16384 个 GPU 使用多达 1.1 万亿个笛卡尔网格点执行的。为了确保软件的可移植性,我们在软件中将 OpenACC 与 MPI 交织在一起。数值解及其形式二阶精度使用为各种边界条件组合制造解的方法进行验证。弱缩放分析是在千万亿级领先级超级计算机上使用 16384 个 GPU 使用多达 1.1 万亿个笛卡尔网格点执行的。为了确保软件的可移植性,我们在软件中将 OpenACC 与 MPI 交织在一起。数值解及其形式二阶精度使用为各种边界条件组合制造解的方法进行验证。弱缩放分析是在千万亿级领先级超级计算机上使用 16384 个 GPU 使用多达 1.1 万亿个笛卡尔网格点执行的。
更新日期:2020-09-01
down
wechat
bug