Multi-GPU Parallelization of the NAS Multi-ZoneParallel Benchmarks,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-GPU Parallelization of the NAS Multi-ZoneParallel Benchmarks
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-01-01 , DOI: 10.1109/tpds.2020.3015148
Marc Gonzalez , Enric Morancho

GPU-based computing systems have become a widely accepted solution for the high-performance-computing (HPC) domain. GPUs have shown highly competitive performance-per-watt ratios and can exploit an astonishing level of parallelism. However, exploiting the peak performance of such devices is a challenge, mainly due to the combination of two essential aspects of multi-GPU execution. On one hand, the workload should be distributed evenly among the GPUs. On the other hand, communications between GPU devices are costly and should be minimized. Therefore, a trade-of between work-distribution schemes and communication overheads will condition the overall performance of parallel applications run on multi-GPU systems. In this article we present a multi-GPU implementation of NAS Multi-Zone Parallel Benchmarks (which execution alternate communication and computational phases). We propose several work-distribution strategies that try to evenly distribute the workload among the GPUs. Our evaluations show that performance is highly sensitive to this distribution strategy, as the the communication phases of the applications are heavily affected by the work-distribution schemes applied in computational phases. In particular, we consider Static, Dynamic, and Guided schedulers to find a trade-off between both phases to maximize the overall performance. In addition, we compare those schedulers with an optimal scheduler computed offline using IBM CPLEX. On an evaluation environment composed of 2 x IBM Power9 8335-GTH and 4 x GPU NVIDIA V100 (Volta), our multi-GPU parallelization outperforms single-GPU execution from 1.48x to 1.86x (2 GPUs) and from 1.75x to 3.54x (4 GPUs). This article analyses these improvements in terms of the relationship between the computational and communication phases of the applications as the number of GPUs is increased. We prove that Guided schedulers perform at similar level as optimal schedulers.

中文翻译：

NAS Multi-ZoneParallel 基准的多 GPU 并行化

基于 GPU 的计算系统已成为高性能计算 (HPC) 领域被广泛接受的解决方案。GPU 已显示出极具竞争力的每瓦性能比，并且可以利用惊人的并行性水平。然而，利用此类设备的峰值性能是一项挑战，主要是由于多 GPU 执行的两个基本方面的结合。一方面，工作负载应该在 GPU 之间平均分配。另一方面，GPU 设备之间的通信成本高昂，应尽量减少。因此，工作分配方案和通信开销之间的权衡将决定在多 GPU 系统上运行的并行应用程序的整体性能。在本文中，我们展示了 NAS 多区域并行基准测试的多 GPU 实现（执行交替的通信和计算阶段）。我们提出了几种工作分配策略，试图在 GPU 之间均匀分配工作负载。我们的评估表明，性能对这种分配策略非常敏感，因为应用程序的通信阶段受到计算阶段应用的工作分配方案的严重影响。特别是，我们考虑了静态、动态和引导式调度程序，以在两个阶段之间找到平衡点，以最大限度地提高整体性能。此外，我们将这些调度程序与使用 IBM CPLEX 离线计算的最佳调度程序进行了比较。在由 2 x IBM Power9 8335-GTH 和 4 x GPU NVIDIA V100 (Volta) 组成的评估环境中，我们的多 GPU 并行化在 1.48x 到 1.86x（2 个 GPU）和从 1.75x 到 3.54x（4 个 GPU）的性能上优于单 GPU。随着 GPU 数量的增加，本文根据应用程序的计算和通信阶段之间的关系分析了这些改进。我们证明了引导式调度器的性能与最佳调度器相似。

更新日期：2021-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11