Evaluating the Cost of Atomic Operations on Modern Architectures,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating the Cost of Atomic Operations on Modern Architectures
arXiv - CS - Performance Pub Date : 2020-10-19 , DOI: arxiv-2010.09852
Hermann Schweizer, Maciej Besta, Torsten Hoefler

Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the hardware implementation of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis enables simpler and more effective parallel programming and accelerates data processing on various architectures deployed in both off-the-shelf machines and large compute systems.

中文翻译：

评估现代架构的原子操作成本

原子操作（原子），例如比较和交换 (CAS) 或获取和添加 (FAA)，在并行编程中无处不在。然而，这些操作与此类系统的各种特性（例如缓存结构）之间的性能权衡尚不清楚，并且尚未得到彻底分析。在本文中，我们建立了一种评估方法，开发了一个性能模型，并针对不同原子的延迟和带宽提出了一组详细的基准。我们考虑了各种最先进的 x86 架构：Intel Haswell、Xeon Phi、Ivy Bridge 和 AMD Bulldozer。结果揭示了所考虑的原子和架构属性（例如访问的缓存行的一致性状态）之间令人惊讶的性能关系。一个关键发现是，所有经过测试的原子都具有可比的延迟和带宽，即使它们具有不同的共识数字。另一个见解是原子的硬件实现会阻止任何指令级并行，即使发出的操作之间没有依赖关系。最后，我们讨论了分析架构中发现的性能问题的解决方案。我们的分析支持更简单、更有效的并行编程，并加速部署在现成机器和大型计算系统中的各种架构上的数据处理。我们讨论了分析架构中发现的性能问题的解决方案。我们的分析支持更简单、更有效的并行编程，并加速部署在现成机器和大型计算系统中的各种架构上的数据处理。我们讨论了分析架构中发现的性能问题的解决方案。我们的分析支持更简单、更有效的并行编程，并加速部署在现成机器和大型计算系统中的各种架构上的数据处理。

更新日期：2020-10-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文