A Programming Model Performance Study Using the NAS Parallel Benchmarks,Scientific Programming

当前位置： X-MOL 学术 › Sci. Program. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Programming Model Performance Study Using the NAS Parallel Benchmarks
Scientific Programming Pub Date : 2010 , DOI: 10.3233/spr-2010-0306
Hongzhang Shan, Filip Blagojević, Seung-Jai Min, Paul Hargrove, Haoqiang Jin, Karl Fuerlinger, Alice Koniges, Nicholas J. Wright

Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the three programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.

中文翻译：

使用NAS并行基准进行编程模型性能研究

由于存在更多并行级别，因此利用多核平台的功能具有挑战性。在本文中，我们使用NAS并行基准研究了三种编程模型MPI，OpenMP和PGAS，以了解它们在当前多核体系结构上的性能和内存使用特性。为了了解这些特性，我们使用了集成性能监控工具和其他方法来衡量通信与计算时间之间的关系，以及在OpenMP中花费的运行时间的一部分。基准测试在两个不同的Cray XT5系统和一个Infiniband群集上运行。我们的结果表明，总体而言，这三种编程模型表现出非常相似的性能特征。在某些情况下，OpenMP显着更快，因为它明确避免了通信。对于这些特殊情况，我们能够重新编写UPC版本，并获得与OpenMP相同的性能。就内存使用而言，使用OpenMP也是最有利的。我们还比较了两个具有四核和十六核处理器的Cray系统之间的性能差异。我们显示，由于对网络资源的争用增加，在十六进制系统上，大规模的性能几乎总是较慢。

更新日期：2020-09-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11