Efficient MPI-based Communication for GPU-Accelerated Dask Applications,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient MPI-based Communication for GPU-Accelerated Dask Applications
arXiv - CS - Performance Pub Date : 2021-01-21 , DOI: arxiv-2101.08878
Aamir Shafi, Jahanzeb Maqbool Hashmi, Hari Subramoni, Dhabaleswar K. Panda

Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for adding new communication devices. It currently has two communication devices: one for TCP and the other for high-speed networks using UCX-Py -- a Cython wrapper to UCX. This paper presents the design and implementation of a new communication backend for Dask -- called MPI4Dask -- that is targeted for modern HPC clusters built with GPUs. MPI4Dask exploits mpi4py over MVAPICH2-GDR, which is a GPU-aware implementation of the Message Passing Interface (MPI) standard. MPI4Dask provides point-to-point asynchronous I/O communication coroutines, which are non-blocking concurrent operations defined using the async/await keywords from the Python's asyncio framework. Our latency and throughput comparisons suggest that MPI4Dask outperforms UCX by 6x for 1 Byte message and 4x for large messages (2 MBytes and beyond) respectively. We also conduct comparative performance evaluation of MPI4Dask with UCX using two benchmark applications: 1) sum of cuPy array with its transpose, and 2) cuDF merge. MPI4Dask speeds up the overall execution time of the two applications by an average of 3.47x and 3.11x respectively on an in-house cluster built with NVIDIA Tesla V100 GPUs for 1-6 Dask workers. We also perform scalability analysis of MPI4Dask against UCX for these applications on TACC's Frontera (GPU) system with upto 32 Dask workers on 32 NVIDIA Quadro RTX 5000 GPUs and 256 CPU cores. MPI4Dask speeds up the execution time for cuPy and cuDF applications by an average of 1.71x and 2.91x respectively for 1-32 Dask workers on the Frontera (GPU) system.

中文翻译：

针对GPU加速的Dask应用的基于MPI的高效通信

Dask是一种流行的并行和分布式计算框架，可与Apache Spark竞争，以实现基于任务的大数据可伸缩处理。Dask分布式库构成此计算引擎的基础，并为添加新的通信设备提供支持。目前，它有两种通信设备：一种用于TCP，另一种用于使用UCX-Py的高速网络-UCX的Cython包装器。本文介绍了针对Dask的新通信后端（称为MPI4Dask）的设计和实现，该后端针对使用GPU构建的现代HPC集群。MPI4Dask通过MVAPICH2-GDR利用mpi4py，MVAPICH2-GDR是消息传递接口（MPI）标准的GPU感知实现。MPI4Dask提供点对点异步I / O通信协程，它们是使用Python的asyncio框架中的async / await关键字定义的非阻塞并发操作。我们的延迟和吞吐量比较表明，对于1字节消息，MPI4Dask的性能要比UCX高6倍，对于大消息（2 MB或更高），MPI4Dask的性能要比UCX高4倍。我们还使用两个基准应用程序对带有UCX的MPI4Dask进行了比较性能评估：1）cuPy数组及其转置之和，以及2）cuDF合并。MPI4Dask在内部集群上为1-6个Dask工作人员构建的内部集群上，这两个应用程序的整体执行时间分别平均缩短了3.47倍和3.11倍，从而平均缩短了3.47倍。我们还针对TACC的Frontera（GPU）系统上的这些应用对MPI4Dask和UCX进行了可伸缩性分析，在32个NVIDIA Quadro RTX 5000 GPU和256个CPU内核上最多支持32名Dask工作人员。

更新日期：2021-01-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>