当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Modeling memory bandwidth patterns on NUMA machines with performance counters
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2021-06-15 , DOI: arxiv-2106.08026
Daniel Goodman, Roni Haecki, Tim Harris

Computers used for data analytics are often NUMA systems with multiple sockets per machine, multiple cores per socket, and multiple thread contexts per core. To get the peak performance out of these machines requires the correct number of threads to be placed in the correct positions on the machine. One particularly interesting element of the placement of memory and threads is the way it effects the movement of data around the machine, and the increased latency this can introduce to reads and writes. In this paper we describe work on modeling the bandwidth requirements of an application on a NUMA compute node based on the placement of threads. The model is parameterized by sampling performance counters during 2 application runs with carefully chosen thread placements. Evaluating the model with thousands of measurements shows a median difference from predictions of 2.34% of the bandwidth. The results of this modeling can be used in a number of ways varying from: Performance debugging during development where the programmer can be alerted to potentially problematic memory access patterns; To systems such as Pandia which take an application and predict the performance and system load of a proposed thread count and placement; To libraries of data structures such as Parallel Collections and Smart Arrays that can abstract from the user memory placement and thread placement issues when parallelizing code.

中文翻译:

使用性能计数器对 NUMA 机器上的内存带宽模式进行建模

用于数据分析的计算机通常是 NUMA 系统,每台机器有多个插槽,每个插槽有多个内核,每个内核有多个线程上下文。要从这些机器中获得最佳性能,需要将正确数量的线程放置在机器上的正确位置。内存和线程放置的一个特别有趣的元素是它影响机器周围数据移动的方式,以及这会增加读取和写入的延迟。在本文中,我们描述了基于线程放置对 NUMA 计算节点上的应用程序的带宽需求进行建模的工作。该模型通过在 2 次应用程序运行期间对性能计数器进行采样来参数化,并使用精心选择的线程放置。使用数千次测量评估模型显示与 2.34% 带宽预测的中值差异。这种建模的结果可以以多种方式使用,包括: 开发过程中的性能调试,程序员可以被提醒注意潜在的有问题的内存访问模式;到诸如 Pandia 之类的系统,它采用应用程序并预​​测建议的线程数和位置的性能和系统负载;数据结构库,如并行集合和智能数组,可以在并行化代码时从用户内存位置和线程位置问题中抽象出来。开发期间的性能调试,可以提醒程序员潜在的有问题的内存访问模式;到诸如 Pandia 之类的系统,它采用应用程序并预​​测建议的线程数和位置的性能和系统负载;数据结构库,如并行集合和智能数组,可以在并行化代码时从用户内存位置和线程位置问题中抽象出来。开发期间的性能调试,可以提醒程序员潜在的有问题的内存访问模式;到诸如 Pandia 之类的系统,它采用应用程序并预​​测建议的线程数和位置的性能和系统负载;数据结构库,如并行集合和智能数组,可以在并行化代码时从用户内存位置和线程位置问题中抽象出来。
更新日期:2021-06-16
down
wechat
bug