当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimised allgatherv, reduce_scatter and allreduce communication in message-passing systems
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-06-23 , DOI: arxiv-2006.13112
Andreas Jocksch and Noe Ohana and Emmanuel Lanti and Vasileios Karakasis and Laurent Villard

Collective communications, namely the patterns allgatherv, reduce_scatter, and allreduce in message-passing systems are optimised based on measurements at the installation time of the library. The algorithms used are set up in an initialisation phase of the communication, similar to the method used in so-called persistent collective communication introduced in the literature. For allgatherv and reduce_scatter the existing algorithms, recursive multiply/divide and cyclic shift (Bruck's algorithm) are applied with a flexible number of communication ports per node. The algorithms for equal message sizes are used with non-equal message sizes together with a heuristic for rank reordering. The two communication patterns are applied in a plasma physics application that uses a specialised matrix-vector multiplication. For the allreduce pattern the cyclic shift algorithm is applied with a prefix operation. The data is gathered and scattered by the cores within the node and the communication algorithms are applied across the nodes. In general our routines outperform the non-persistent counterparts in established MPI libraries by up to one order of magnitude or show equal performance, with a few exceptions of number of nodes and message sizes.

中文翻译:

优化了消息传递系统中的 allgatherv、reduce_scatter 和 allreduce 通信

集体通信,即消息传递系统中的模式 allgatherv、reduce_scatter 和 allreduce,根据库安装时的测量进行了优化。所使用的算法是在通信的初始化阶段设置的,类似于文献中介绍的所谓持久集体通信中使用的方法。对于 allgatherv 和 reduce_scatter 现有算法,递归乘法/除法和循环移位(布鲁克算法)适用于每个节点具有灵活数量的通信端口。消息大小相等的算法与不相等的消息大小一起使用,并结合用于秩重新排序的试探法。这两种通信模式应用于等离子体物理应用程序,该应用程序使用专门的矩阵向量乘法。对于 allreduce 模式,循环移位算法与前缀操作一起应用。数据由节点内的核心收集和分散,通信算法应用于节点之间。一般来说,我们的例程比已建立的 MPI 库中的非持久性对应程序的性能高出一个数量级或表现出相同的性能,除了节点数量和消息大小的一些例外。
更新日期:2020-06-24
down
wechat
bug