Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2015-05-01 , DOI: 10.1109/tpds.2014.2316825
Javier Cabezas ₁ , Isaac Gelado ₂ , John E Stone ₃ , Nacho Navarro ₁ , David B Kirk ₂ , Wen-Mei Hwu ₄

Affiliation

Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

中文翻译：

多加速器应用程序中高效数据交换的运行时和架构支持

异构并行计算应用程序通常处理大型数据集，需要多个 GPU 共同满足其对物理内存容量和计算吞吐量的需求。然而，以往异构并行编程模型缺乏高层抽象，迫使程序员在多个GPU设备之间交换数据时，不得不求助于多个代码版本、复杂的数据复制步骤和同步方案，导致软件开发成本高、可维护性差、甚至表现不佳。本文介绍了 HPE 运行时系统和相关的架构支持，它支持一个简单、高效的编程接口，用于通过互连或跨节点网络接口在多个 GPU 之间交换数据。本文中提供的运行时和架构支持也可用于支持其他类型的加速器。我们展示了简化的编程接口降低了编程复杂性。本文中介绍的研究始于 2009 年。它已在几代 HPE 运行时系统中广泛实施和测试，并自 2011 年起被用于 NVIDIA GPU 硬件和 CUDA 4.0 及更高版本的驱动程序。支持的真实硬件的可用性通过在实际运行时和硬件上运行重要的基准测试，HPE 的关键功能为研究硬件支持的有效性提供了难得的机会。实验结果表明，在示例异构系统中，对等 DMA 和双缓冲、固定缓冲区、软件技术可以将加速器间数据通信带宽提高2倍。它们还可以将 3D 有限差分的执行速度提高 1.6 倍，1D FFT 的执行速度提高 2.5 倍，归并排序的执行速度提高 1.6 倍，所有这些都是在真实硬件上测得的。建议的架构支持使 HPE 运行时能够在简单的可移植用户代码下透明地部署这些优化，从而允许系统设计人员自由地使用不同功能的设备。我们进一步认为，大多数应用程序都需要简单的接口（例如 HPE）才能在实践中受益于高级硬件功能。提议的架构支持使 HPE 运行时能够在简单的可移植用户代码下透明地部署这些优化，从而允许系统设计人员自由地使用不同功能的设备。我们进一步认为，大多数应用程序都需要简单的接口（例如 HPE）才能在实践中受益于高级硬件功能。提议的架构支持使 HPE 运行时能够在简单的可移植用户代码下透明地部署这些优化，从而允许系统设计人员自由地使用不同功能的设备。我们进一步认为，大多数应用程序都需要简单的接口（例如 HPE）才能在实践中受益于高级硬件功能。

更新日期：2015-05-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>