当前位置:
X-MOL 学术
›
arXiv.cs.DC
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2021-02-24 , DOI: arxiv-2102.12416 Jaemin Choi, Zane Fink, Sam White, Nitin Bhat, David F. Richards, Laxmikant V. Kale
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2021-02-24 , DOI: arxiv-2102.12416 Jaemin Choi, Zane Fink, Sam White, Nitin Bhat, David F. Richards, Laxmikant V. Kale
As an increasing number of leadership-class systems embrace GPU accelerators
in the race towards exascale, efficient communication of GPU data is becoming
one of the most critical components of high-performance computing. For
developers of parallel programming models, implementing support for GPU-aware
communication using native APIs for GPUs such as CUDA can be a daunting task as
it requires considerable effort with little guarantee of performance. In this
work, we demonstrate the capability of the Unified Communication X (UCX)
framework to compose a GPU-aware communication layer that serves multiple
parallel programming models developed out of the Charm++ ecosystem, including
MPI and Python: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the
performance impact of our designs with microbenchmarks adapted from the OSU
benchmark suite, obtaining improvements in latency of up to 10.2x, 11.7x, and
17.4x in Charm++, AMPI, and Charm4py, respectively. We also observe increases
in bandwidth of up to 9.6x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We
show the potential impact of our designs on real-world applications by
evaluating weak and strong scaling performance of a proxy application that
performs the Jacobi iterative method, improving the communication performance
by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py.
中文翻译:
在并行编程模型:Charm ++,MPI和Python中与UCX进行GPU感知的通信
随着越来越多的领导级系统采用GPU加速器来进行万亿级竞争,GPU数据的有效通信已成为高性能计算的最关键组成部分之一。对于并行编程模型的开发人员来说,使用针对GPU的本地API(例如CUDA)来实现对支持GPU的通信的支持可能是一项艰巨的任务,因为它需要付出大量努力,而性能却几乎没有保证。在这项工作中,我们演示了统一通信X(UCX)框架构成可感知GPU的通信层的功能,该层服务于从Charm ++生态系统开发的多个并行编程模型,包括MPI和Python:Charm ++,自适应MPI(AMPI)和Charm4py。我们使用OSU基准套件改编的微基准论证了设计对性能的影响,在Charm ++,AMPI和Charm4py中分别获得了高达10.2倍,11.7倍和17.4倍的延迟改进。我们还观察到Charm ++的带宽增加了9.6倍,AMPI的带宽增加了10倍,而Charm4py的带宽增加了10.5倍。通过评估执行Jacobi迭代方法的代理应用程序的弱扩展性能和强扩展性能,我们展示了设计对实际应用程序的潜在影响,在Charm ++中将通信性能提高了12.4倍,在AMPI中将通信性能提高了12.8倍,并将其提高了19.7 x在Charm4py中。
更新日期:2021-02-25
中文翻译:
在并行编程模型:Charm ++,MPI和Python中与UCX进行GPU感知的通信
随着越来越多的领导级系统采用GPU加速器来进行万亿级竞争,GPU数据的有效通信已成为高性能计算的最关键组成部分之一。对于并行编程模型的开发人员来说,使用针对GPU的本地API(例如CUDA)来实现对支持GPU的通信的支持可能是一项艰巨的任务,因为它需要付出大量努力,而性能却几乎没有保证。在这项工作中,我们演示了统一通信X(UCX)框架构成可感知GPU的通信层的功能,该层服务于从Charm ++生态系统开发的多个并行编程模型,包括MPI和Python:Charm ++,自适应MPI(AMPI)和Charm4py。我们使用OSU基准套件改编的微基准论证了设计对性能的影响,在Charm ++,AMPI和Charm4py中分别获得了高达10.2倍,11.7倍和17.4倍的延迟改进。我们还观察到Charm ++的带宽增加了9.6倍,AMPI的带宽增加了10倍,而Charm4py的带宽增加了10.5倍。通过评估执行Jacobi迭代方法的代理应用程序的弱扩展性能和强扩展性能,我们展示了设计对实际应用程序的潜在影响,在Charm ++中将通信性能提高了12.4倍,在AMPI中将通信性能提高了12.8倍,并将其提高了19.7 x在Charm4py中。