Towards Efficient Short-Range Pair Interaction on Sunway Many-Core Architecture,Journal of Computer Science and Technology

当前位置： X-MOL 学术 › J. Comput. Sci. Tech. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards Efficient Short-Range Pair Interaction on Sunway Many-Core Architecture
Journal of Computer Science and Technology ( IF 1.2 ) Pub Date : 2021-01-30 , DOI: 10.1007/s11390-020-9826-z
Jun-Shi Chen , Hong An , Wen-Ting Han , Zeng Lin , Xin Liu

The short-range pair interaction consumes most of the CPU time in molecular dynamics (MD) simulations. The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core architecture. In this paper, we present a highly efficient short-range force kernel on the Sunway, a novel many-core architecture with many unique features. The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts. To enhance the data locality, we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores. In the absence of a low overhead locking mechanism, using data-privatization force array is a more feasible method to avoid write conflicts, but results in the large overhead of data reduction. We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks, which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing. Moreover, we exploit the single instruction multiple data (SIMD) parallelism and perform instruction reordering of the force kernel on this many-core processor. The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20% of peak flop rate on the Sunway many-core processor.

中文翻译：

在Sunway多核架构上实现高效的短程对交互

在分子动力学（MD）模拟中，短程对相互作用消耗了大部分CPU时间。固有的计算稀疏性使得在新兴的多核体系结构上实现高性能内核具有挑战性。在本文中，我们在双威（Sunway）上提出了一种高效的短程受力内核，这是一种具有许多独特功能的新颖多核体系结构。不良的数据局部性和写入冲突严重限制了该算法在Sunway多核处理器上的并行效率。为了增强数据的局部性，我们采用了基于超级群集的邻居列表，该列表具有适合计算核心的本地内存的适当粒度。在没有低开销锁定机制的情况下，使用数据私有化强制数组是避免写冲突的更可行的方法，但是会导致大量的数据缩减开销。对于硬件资源和计算任务，我们采用双片分区方案，该方案利用片上数据通信来减少数据缩减开销并提供负载平衡。此外，我们利用单指令多数据（SIMD）并行性，并在此多核处理器上执行强制内核的指令重新排序。实验结果表明，与参考实现相比，优化后的力量内核可获得226倍的性能提速，并在Sunway多核处理器上实现了20％的峰值翻转率。我们利用单指令多数据（SIMD）并行性，并在此多核处理器上对Force内核执行指令重新排序。实验结果表明，与参考实现相比，优化后的力量内核可获得226倍的性能提速，并在Sunway多核处理器上实现了20％的峰值翻转率。我们利用单指令多数据（SIMD）并行性，并在此多核处理器上对Force内核执行指令重新排序。实验结果表明，与参考实现相比，优化后的力量内核可获得226倍的性能提速，并在Sunway多核处理器上实现了20％的峰值翻转率。

更新日期：2021-02-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11