当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Rapid Exploration of Optimization Strategies on Advanced Architectures using TestSNAP and LAMMPS
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-11-25 , DOI: arxiv-2011.12875
Rahulkumar Gayatri, Stan Moore, Evan Weinberg, Nicholas Lubbers, Sarah Anderson, Jack Deslippe, Danny Perez, Aidan P. Thompson

The exascale race is at an end with the announcement of the Aurora and Frontier machines. This next generation of supercomputers utilize diverse hardware architectures to achieve their compute performance, providing an added onus on the performance portability of applications. An expanding fragmentation of programming models would provide a compounding optimization challenge were it not for the evolution of performance-portable frameworks, providing unified models for mapping abstract hierarchies of parallelism to diverse architectures. A solution to this challenge is the evolution of performance-portable frameworks, providing unified models for mapping abstract hierarchies of parallelism to diverse architectures. Kokkos is one such performance portable programming model for C++ applications, providing back-end implementations for each major HPC platform. Even with a performance portable framework, restructuring algorithms to expose higher degrees of parallelism is non-trivial. The Spectral Neighbor Analysis Potential (SNAP) is a machine-learned inter-atomic potential utilized in cutting-edge molecular dynamics simulations. Previous implementations of the SNAP calculation showed a downward trend in their performance relative to peak on newer-generation CPUs and low performance on GPUs. In this paper we describe the restructuring and optimization of SNAP as implemented in the Kokkos CUDA backend of the LAMMPS molecular dynamics package, benchmarked on NVIDIA GPUs. We identify novel patterns of hierarchical parallelism, facilitating a minimization of memory access overheads and pushing the implementation into a compute-saturated regime. Our implementation via Kokkos enables recompile-and-run efficiency on upcoming architectures. We find a $\sim$22x time-to-solution improvement relative to an existing implementation as measured on an NVIDIA Tesla V100-16GB for an important benchmark.

中文翻译:

使用TestSNAP和LAMMPS快速探索高级体系结构的优化策略

随着Aurora和Frontier机器的发布,亿万富翁竞赛结束了。下一代超级计算机利用各种硬件体系结构来实现其计算性能,从而为应用程序的性能可移植性增加了负担。如果不是因为性能便携式框架的发展,编程模型的不断扩大的碎片化将给复合优化带来挑战,因为它提供了用于将并行抽象层次结构映射到各种体系结构的统一模型。解决此挑战的方法是性能便携式框架的发展,该框架提供了用于将并行性抽象层次结构映射到各种体系结构的统一模型。Kokkos是一种针对C ++应用程序的此类性能可移植编程模型,提供每个主要HPC平台的后端实施。即使具有性能可移植的框架,重组算法以暴露更高程度的并行性也是不平凡的。光谱邻域分析电势(SNAP)是机器学习的原子间电势,用于最先进的分子动力学模拟。相对于新一代CPU的峰值和GPU的低性能,SNAP计算的先前实现显示出其性能下降的趋势。在本文中,我们描述了在LAMMPS分子动力学软件包的Kokkos CUDA后端中实现的SNAP的重组和优化,该软件包以NVIDIA GPU为基准。我们确定了分层并行性的新颖模式,这有助于最小化内存访问开销,并将实现推入到计算饱和的状态。我们通过Kokkos实施可在即将到来的架构上实现重新编译和运行的效率。我们发现,相对于现有实现,在NVIDIA Tesla V100-16GB上测得的解决方案时间缩短了$ \ sim $ 22x,这是一个重要的基准。
更新日期:2020-11-27
down
wechat
bug