Performance Optimization of SU3_Bench on Xeon and Programmable Integrated Unified Memory Architecture,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Performance Optimization of SU3_Bench on Xeon and Programmable Integrated Unified Memory Architecture
arXiv - CS - Hardware Architecture Pub Date : 2021-02-28 , DOI: arxiv-2103.00571
Jesmin Jahan Tithi, Fabio Checconi, Douglas Doerfler, Fabrizio Petrini

SU3\_Bench is a microbenchmark developed to explore performance portability across multiple programming models/methodologies using a simple, but nontrivial, mathematical kernel. This kernel has been derived from the MILC lattice quantum chromodynamics (LQCD) code. SU3\_Bench is bandwidth bound and generates regular compute and data access patterns. Therefore, on most traditional CPU and GPU-based systems, its performance is mainly determined by the achievable memory bandwidth. Although SU3\_Bench is a simple kernel, experience says its subtleties require a certain amount of tweaking to achieve peak performance for a given programming model and hardware, making performance portability challenging. In this paper, we share some of the challenges in obtaining the peak performance for SU3\_Bench on a state-of-the-art Intel Xeon machine, due to the nuances of variable definition, the nature of compiler-provided default constructors, how memory is accessed at object creation time, and the NUMA effects on the machine. We discuss how to tackle those challenges to improve SU3\_Bench's performance by \(2\times\) compared to the original OpenMP implementation available at Github. This provides a valuable lesson for other similar kernels. Expanding on the performance portability aspects, we also show early results obtained porting SU3\_Bench to the new Intel Programmable Integrated Unified Memory Architecture (PIUMA), characterized by a more balanced flops-to-byte ratio. This paper shows that it is not the usual bandwidth or flops, rather the pipeline throughput, that determines SU3\_Bench's performance on PIUMA. Finally, we show how to improve performance on PIUMA and how that compares with the performance on Xeon, which has around one order of magnitude more flops-per-byte.

中文翻译：

SU3_Bench在Xeon和可编程集成统一内存体系结构上的性能优化

SU3 \ _Bench是一种微基准测试，旨在使用简单但不平凡的数学内核探索跨多种编程模型/方法的性能可移植性。该内核是从MILC晶格量子色动力学（LQCD）代码派生而来的。SU3 \ _Bench受带宽限制，并生成常规的计算和数据访问模式。因此，在大多数传统的基于CPU和GPU的系统上，其性能主要取决于可实现的内存带宽。尽管SU3 \ _Bench是一个简单的内核，但经验表明，其细微之处需要进行一定的调整才能在给定的编程模型和硬件上达到最高性能，从而使性能可移植性具有挑战性。在本文中，我们分享了在最先进的Intel Xeon机器上获得SU3 \ _Bench的峰值性能方面的一些挑战，由于变量定义的细微差别，编译器提供的默认构造函数的性质，在对象创建时如何访问内存以及NUMA对计算机的影响。与Github上可用的原始OpenMP实现相比，我们讨论了如何解决这些挑战，以通过\（2 \ times \）提高SU3 \ _Bench的性能。这为其他类似的内核提供了宝贵的经验。在性能可移植性方面进行扩展时，我们还显示了将SU3 \ _Bench移植到新的英特尔可编程集成统一内存体系结构（PIUMA）所获得的早期结果，该体系结构具有更高的触发器字节比例平衡。本文表明，决定SU3 \ _Bench在PIUMA上的性能的不是通常的带宽或触发器，而是管道的吞吐量。最后，

更新日期：2021-03-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文