Accelerated FDPS: Algorithms to use accelerators with FDPS,Publications of the Astronomical Society of Japan

当前位置： X-MOL 学术 › Publ. Astron. Soc. Jpn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Accelerated FDPS: Algorithms to use accelerators with FDPS
Publications of the Astronomical Society of Japan ( IF 2.2 ) Pub Date : 2020-02-01 , DOI: 10.1093/pasj/psz133
Masaki Iwasawa _{1,

2} , Daisuke Namekata ₂ , Keigo Nitadori ₂ , Kentaro Nomura _{1,

2} , Long Wang _{1,

3} , Miyuki Tsubouchi ₂ , Junichiro Makino _{1,

2,

4}

Affiliation

In this paper, we describe the algorithms we implemented in FDPS to make efficient use of accelerator hardware such as GPGPUs. We have developed FDPS to make it possible for many researchers to develop their own high-performance parallel particle-based simulation programs without spending large amount of time for parallelization and performance tuning. The basic idea of FDPS is to provide a high-performance implementation of parallel algorithms for particle-based simulations in a "generic" form, so that researchers can define their own particle data structure and interparticle interaction functions and supply them to FDPS. FDPS compiled with user-supplied data type and interaction function provides all necessary functions for parallelization, and using those functions researchers can write their programs as though they are writing simple non-parallel program. It has been possible to use accelerators with FDPS, by writing the interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of user-provided interaction function so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the side of CPU and amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a systems with NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27 \% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth.

中文翻译：

加速 FDPS：将加速器与 FDPS 结合使用的算法

在本文中，我们描述了我们在 FDPS 中实现的算法，以有效利用加速器硬件，例如 GPGPU。我们开发了 FDPS，使许多研究人员可以开发自己的高性能并行基于粒子的仿真程序，而无需花费大量时间进行并行化和性能调整。FDPS 的基本思想是以“通用”形式为基于粒子的模拟提供并行算法的高性能实现，以便研究人员可以定义自己的粒子数据结构和粒子间相互作用函数，并提供给 FDPS。用用户提供的数据类型和交互函数编译的 FDPS 提供了并行化所需的所有功能，使用这些函数，研究人员可以像编写简单的非并行程序一样编写他们的程序。通过编写使用加速器的交互函数，可以将加速器与 FDPS 一起使用。然而，效率受到 CPU 和加速器之间通信的延迟和带宽以及交互功能的可用并行度与硬件并行度之间的不匹配的限制。我们修改了用户提供的交互功能的界面，以便更有效地使用加速器。我们还实施了新技术，以减少 CPU 方面的工作量以及 CPU 与加速器之间的通信量。我们使用 FDPS 在配备 NVIDIA Volta GPGPU 的系统上测量了 N 体模拟的性能，实现的性能约为理论峰值限制的 27%。我们构建了一个详细的性能模型，发现当前的实现可以在内存和通信带宽小得多的系统上实现良好的性能。

更新日期：2020-02-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11