Parallel Computing ( IF 1.4 ) Pub Date : 2020-07-24 , DOI: 10.1016/j.parco.2020.102669 Huan Zhou , José Gracia , Naweiluo Zhou , Ralf Schneider
The use of hybrid scheme combining the message passing programming models for inter-node parallelism and the shared memory programming models for node-level parallelism is widely spread. Existing extensive practices on hybrid Message Passing Interface (MPI) plus Open Multi-Processing (OpenMP) programming account for its popularity. Nevertheless, strong programming efforts are required to gain performance benefits from the MPI+OpenMP code. An emerging hybrid method that combines MPI and the MPI shared memory model (MPI+MPI) is promising. However, writing an efficient hybrid MPI+MPI program – especially when the collective communication operations are involved – is not to be taken for granted.
In this paper, we propose a new design method to implement hybrid MPI+MPI context-based collective communication operations. Our method avoids on-node memory replications (on-node communication overheads) that are required by semantics in pure MPI. We also offer wrapper primitives hiding all the design details from users, which comes with practices on how to structure hybrid MPI+MPI code with these primitives. Further, the on-node synchronization scheme required by our method/collectives gets optimized. The micro-benchmarks show that our collectives are comparable or superior to those in pure MPI context. We have further validated the effectiveness of the hybrid MPI+MPI model (which uses our wrapper primitives) in three computational kernels, by comparison to the pure MPI and hybrid MPI+OpenMP models.
中文翻译:
MPI + MPI混合代码中的集体:设计,实践和性能
混合方案的使用广泛,它结合了用于节点间并行性的消息传递编程模型和用于节点级并行性的共享内存编程模型。混合消息传递接口(MPI)加上开放式多进程(OpenMP)编程的现有广泛实践说明了它的流行。但是,需要大量的编程工作才能从MPI + OpenMP代码获得性能收益。结合MPI和MPI共享内存模型(MPI + MPI)的新兴混合方法很有希望。但是,编写高效的MPI + MPI混合程序(尤其是涉及集体通信操作时)不应被认为是理所当然的。
在本文中,我们提出了一种新的设计方法来实现基于MPI + MPI上下文的混合式集体通信操作。我们的方法避免了纯MPI中语义所要求的节点上内存复制(节点上通信开销)。我们还提供了对用户隐藏所有设计细节的包装原语,并提供了如何使用这些原语构造MPI + MPI混合代码的实践。此外,我们的方法/集合所需的节点上同步方案得到了优化。微观基准表明,我们的集体与纯MPI环境下的集体具有可比性或优越性。通过与纯MPI和混合MPI + OpenMP模型进行比较,我们在三个计算内核中进一步验证了混合MPI + MPI模型(使用包装器原语)的有效性。