当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MPI Collectives for Multi-core Clusters: Optimized Performance of the Hybrid MPI+MPI Parallel Codes
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-07-14 , DOI: arxiv-2007.06892
Huan Zhou and Jose Gracia and Ralf Schneider

The advent of multi-/many-core processors in clusters advocates hybrid parallel programming, which combines Message Passing Interface (MPI) for inter-node parallelism with a shared memory model for on-node parallelism. Compared to the traditional hybrid approach of MPI plus OpenMP, a new, but promising hybrid approach of MPI plus MPI-3 shared-memory extensions (MPI+MPI) is gaining attraction. We describe an algorithmic approach for collective operations (with allgather and broadcast as concrete examples) in the context of hybrid MPI+MPI, so as to minimize memory consumption and memory copies. With this approach, only one memory copy is maintained and shared by on-node processes. This allows the removal of unnecessary on-node copies of replicated data that are required between MPI processes when the collectives are invoked in the context of pure MPI. We compare our approach of collectives for hybrid MPI+MPI and the traditional one for pure MPI, and also have a discussion on the synchronization that is required to guarantee data integrity. The performance of our approach has been validated on a Cray XC40 system (Cray MPI) and NEC cluster (OpenMPI), showing that it achieves comparable or better performance for allgather operations. We have further validated our approach with a standard computational kernel, namely distributed matrix multiplication, and a Bayesian Probabilistic Matrix Factorization code.

中文翻译:

多核集群的 MPI 集合:混合 MPI+MPI 并行代码的优化性能

集群中多核/多核处理器的出现提倡混合并行编程,它将用于节点间并行的消息传递接口 (MPI) 与用于节点并行的共享内存模型相结合。与 MPI 加 OpenMP 的传统混合方法相比,MPI 加 MPI-3 共享内存扩展 (MPI+MPI) 的一种新的但很有前途的混合方法越来越有吸引力。我们在混合 MPI+MPI 的上下文中描述了集体操作的算法方法(以 allgather 和broadcast 作为具体示例),以最大限度地减少内存消耗和内存副本。通过这种方法,节点上的进程只维护和共享一个内存副本。当在纯 MPI 的上下文中调用集合时,这允许删除 MPI 进程之间所需的复制数据的不必要的节点上副本。我们比较了混合 MPI+MPI 的集合方法和纯 MPI 的传统方法,并讨论了保证数据完整性所需的同步。我们的方法的性能已在 Cray XC40 系统 (Cray MPI) 和 NEC 集群 (OpenMPI) 上得到验证,表明它在所有收集操作中实现了可比或更好的性能。我们使用标准计算内核(即分布式矩阵乘法)和贝叶斯概率矩阵分解代码进一步验证了我们的方法。并且还讨论了保证数据完整性所需的同步。我们的方法的性能已在 Cray XC40 系统 (Cray MPI) 和 NEC 集群 (OpenMPI) 上得到验证,表明它在所有收集操作中实现了可比或更好的性能。我们使用标准计算内核(即分布式矩阵乘法)和贝叶斯概率矩阵分解代码进一步验证了我们的方法。并且还讨论了保证数据完整性所需的同步。我们的方法的性能已在 Cray XC40 系统 (Cray MPI) 和 NEC 集群 (OpenMPI) 上得到验证,表明它在所有收集操作中实现了可比或更好的性能。我们使用标准计算内核(即分布式矩阵乘法)和贝叶斯概率矩阵分解代码进一步验证了我们的方法。
更新日期:2020-07-15
down
wechat
bug