FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures

https://doi.org/10.1016/j.jpdc.2020.05.008Get rights and content

Highlights

  • Identified the challenges involved in zero-copy MPI derived datatypes processing

  • Proposed novel designs to overcome semantic limitations and performance overheads

  • Demonstrated the efficacy of proposed designs on state-of-the-art CPU and GPU systems

  • Performance evaluation on diverse CPU and GPU architectures (e.g., DGX-2)

  • Achieved significant improvement over other MPI libraries on modern HPC hardware

Abstract

This paper addresses the challenges of MPI derived datatype processing and proposes FALCON-X — A Fast and Low-overhead Communication framework for optimized zero-copy intra-node derived datatype communication on emerging CPU/GPU architectures. We quantify various performance bottlenecks such as memory layout translation and copy overheads for highly fragmented MPI datatypes and propose novel pipelining and memoization-based designs to achieve efficient derived datatype communication. In addition, we also propose enhancements to the MPI standard to address the semantic limitations. The experimental evaluations show that our proposed designs significantly improve the intra-node communication latency and bandwidth over state-of-the-art MPI libraries on modern CPU and GPU systems. By using representative application kernels such as MILC, WRF, NAS_MG, Specfem3D, and Stencils on three different CPU architectures and two different GPU systems including DGX-2, we demonstrate up to 5.5x improvement on multi-core CPUs and 120x benefits on DXG-2 GPU system over state-of-the-art designs in other MPI libraries.

Introduction

Modern High-Performance Computing (HPC) systems are enabling scientists from different research domains to explore, model, and simulate computation-heavy problems at different scales. The availability of multi- and many-core architectures (e.g., Intel Xeon, Xeon Phi, OpenPOWER, and NVIDIA Volta GPUs) has significantly accelerated the impact and capabilities of such large-scale systems. The current multi-petaflop systems are powered by such multi- and many-core CPUs and GPUs, and the adoption of these many-core architectures is expected to grow in future exascale systems [1]. Message Passing Interface (MPI) [19] has been used as the de-facto programming model for developing high-performance parallel scientific applications for such systems while Compute Unified Device Architecture (CUDA) being the primary programming interface to exploit NVIDIA GPUs. The emergence of CUDA-aware MPI [35] has relieved the application developers of manually moving the data between the host (CPU) and device (GPU) memories by switching between MPI and CUDA programming models for the communication phases of the applications. This allows the decoupling of CUDA and MPI programming models within the applications as the CUDA kernels are now used for computation while MPI is used to derive applications’ communication. This ubiquity of MPI as a de-facto programming model for modern CPU and GPU-based systems mandates that the MPI libraries must be carefully designed to deliver the best possible performance for different communication primitives.

High-performance parallel algorithms and scientific applications often need to communicate non-contiguous data. For example, matrix-multiplication or halo-exchange often requires communicating one or multiple columns of large matrices stored in row-major format. To achieve this, the application can ‘pack’ the data into a temporary contiguous buffer and send it to the recipient process, which can then ‘unpack’ the data. However, this approach (known as “Manual Packing/Unpacking”) provides poor performance due to the multiple copies of the data and the increased memory footprint of the application. Researchers have shown that this packing/unpacking can take up to 90% of the total communication cost [29]. Moreover, this places the burden of managing these temporary buffers and manually copying the data on the application developer, leading to poor productivity.

To address this, MPI provides a feature called Derived Datatypes (DDT) for communicating non-contiguous data in a portable and efficient manner. In this approach, the application composes a Derived Datatype using simple datatypes predefined by the MPI standard; and uses this datatype in the communication primitives. However, state-of-the-art MPI libraries suffer from the poor performance of derived datatype processing causing many applications such as WRF [38], MILC [18], NAS MG [21], and SPECFEM [32] to still rely on the manual pack/unpack method instead of using DDTs [29]. While researchers have proposed designs to improve the communication performance of DDTs on interconnects like InfiniBand [16], [27], [33], some of the fundamental bottlenecks in datatype processing such as efficient translation from the datatype to memory-layout remain unsolved. Furthermore, the challenges involved in handling the derived datatype communication for both CPU and GPU resident data brings forth several new challenges. For instance, MPI libraries still use shared-memory-based designs for intra-node datatype communication for CPU resident data, which requires multiple copies and offers poor performance and overlap. Similarly, state-of-the-art designs in CUDA-aware MPI implementations employ CUDA kernel-based solution to accelerate the packing/unpacking phases [4], [30], [39] for GPU-resident data. However, it still suffers from the significant synchronization overhead between CPU and GPU.

While zero-copy techniques for improving intra-node communication performance have been studied in depth [2], [3], [9], [11], [15], the trade-offs involved in using these techniques for non-contiguous communication have not been explored in the literature. In this work, we show that using zero-copy techniques for MPI datatypes exposes novel challenges in terms of correctness and performance, and propose efficient designs to address these issues for modern CPU and GPU systems. We also propose designs to reduce the layout translation overhead through memoization-based techniques. Finally, we show that current MPI datatype routines are not able to fully take advantage of the zero-copy semantics and propose enhancements to address these limitations.

Section snippets

Motivation

The poor performance of datatype-based communication in MPI libraries has been well documented in the literature [34], [40]. To understand the bottlenecks involved in datatype-based communication in MPI for CPU and GPU systems, we analyze the communication latency of one such transfer from various representative application kernels — WRF, MILC, NAS, and SPECFEC3D_CM provided by DDTBench [28]. We have also modified the DDTBench to support evaluating CUDA-aware MPI libraries.

Fig. 1 shows the

Contributions

These observations lead us to the following broad challenge: How can we design a high-performance and efficient zero-copy-based communication runtime for MPI derived datatypes on modern CPU/GPU systems? In our prior work [12] we have discussed the challenges involving CPU-based derived datatype processing. In this paper, we enhance our earlier work on CPUs and augment it further by proposing designs for GPU-based MPI derived datatype processing in CUDA-aware MPI libraries.

In this work, we

Designing zero-copy MPI datatype processing on modern CPUs

In this section, we look at the detailed designs for using zero-copy techniques for datatype-based intra-node communication, and propose mechanisms to improve the performance of such designs.

Designing zero-copy design on multi-GPU systems

In the modern multi-GPU systems, high-performance interconnects such as PCIe and NVLink are widely used to connect GPUs. This enables peer-to-peer (P2P) access by using either the driver APIs (i.e., via copy call) or within a compute kernel (i.e., via direct load-store operations). In this section, we elaborate on the proposed MPI-level solutions to address the challenges of leveraging the P2P feature for achieving the zero-copy data movement of non-contiguous GPU-resident data. We implemented

Efficient layout translation designs for CPU/GPU systems

The designs proposed in Section 4 focus on reducing the cost of exchanging the sender’s layout. However, they still involve the layout translation from the local datatypes on the receiver process. As shown in Fig. 5, this datatype to memory layout translation can be very costly for nested datatypes due to the recursive nature of the datatype parsing. To mitigate this overhead, we propose two approaches that eliminate the layout translation overheads for both the sender and the receiver

Experimental evaluation

We used three production MPI libraries — MVAPICH2-X v2.3rc1, MVAPICH2-GDR 2.3.2, Intel MPI (IMPI) v2018.1.163 and v2019.0.045, and Open MPI v3.1.2 with UCX v1.3.1. MVAPICH2-X and Open MPI+UCX were configured to use XPMEM as the intra-node transport mechanism. Our early experiments showed IMPI 2018 to be performing better than IMPI 2019 for certain benchmarks. We attribute this to the lack of optimizations for derived datatypes in libfabric. Thus, we present the results for both versions of

Related work

Researchers have explored network features to improve the performance of MPI datatype processing. Santhanaraman et al. [27] leveraged the Scatter/Gather List (SGL) feature to propose a zero-copy based scheme called SGRS. Li et al. [16] exploited User-mode Memory Registration (UMR) feature of Infiniband to remove the packing/unpacking overhead on the sender and receiver sides, which led to better performance. Their design also had lower memory utilization as it avoided the need for the

Conclusion and future work

In this paper, we identified the challenges involved in designing intra-node zero-copy communication schemes for MPI derived datatypes on modern multi-/many-core CPU and GPU architectures and proposed designs to address them efficiently. The proposed solutions, referred to as FALCON-X, reduced the cost of layout translation and exchange using novel designs based on pipelining and memoization. Finally, we propose enhancements to the MPI datatype creation semantics to enable future avenues for

CRediT authorship contribution statement

Jahanzeb Maqbool Hashmi: Conceptualization, Investigation, Methodology, Writing - original draft. Ching-Hsiang Chu: Conceptualization, Investigation. Sourav Chakraborty: Methodology, Writing - review & editing. Mohammadreza Bayatpour: Data curation, Methodology. Hari Subramoni: Supervision. Dhabaleswar K. Panda: Supervision, Validation, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is supported in part by NSF grants #ACI-2007991, #CNS-1513120, #ACI-1450440, #CCF-1565414, #ACI-1664137, and #ACI-1931537. The authors would like to thank Dr. Sadaf Alam and Dr. Carlos Osuna for providing access to the CSCS testbed.

Jahanzeb Maqbool Hashmi is a Ph.D. candidate at The Ohio State University where he works at Network-Based Computing Laboratory (NBCL). Before joining NBCL, he was a graduate fellow at the Department of Computer Science and Engineering, Ohio State University. His research is mainly targeted at enabling high-level programming models and runtime systems to achieve high-performance and scalability on dense CPU and GPU architectures. He works on the areas related to shared-memory and

References (40)

  • GoglinB. et al.

    KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework

    J. Parallel Distrib. Comput.

    (2013)
  • Aurora supercomputer,...
  • BayatpourM. et al.

    SALaR: Scalable and adaptive designs for large message reduction collectives

  • S. Chakraborty, H. Subramoni, D. Panda, Contention aware kernel-assisted MPI collectives for multi/many-core systems,...
  • C.-H. Chu, K. Hamidouche, A. Venkatesh, D.S. Banerjee, H. Subramoni, D.K. Panda, Exploiting maximal overlap for...
  • ChuC.-H. et al.

    High-performance adaptive MPI derived datatype communication for modern multi-GPU systems

  • FriedleyA. et al.

    Hybrid MPI: Efficient message passing for multi-core systems

  • FriedleyA. et al.

    Ownership passing: Efficient distributed memory programming on multi-core systems

  • GanianR. et al.

    Polynomial-time construction of optimal MPI derived datatype trees

  • GroppW. et al.

    Improving the performance of MPI derived datatypes

  • HashmiJ.M. et al.

    Designing efficient shared address space reduction collectives for multi-/many-cores

  • HashmiJ.M. et al.

    FALCON: Efficient designs for zero-copy MPI datatype processing on emerging architectures

  • JenkinsJ. et al.

    Processing MPI derived datatypes on noncontiguous GPU-resident data

    IEEE Trans. Parallel Distrib. Syst.

    (2014)
  • J. Jenkins, J. Dinan, P. Balaji, N.F. Samatova, R. Thakur, Enabling fast, noncontiguous GPU data movement in hybrid MPI...
  • JinH.-W. et al.

    Limic: Support for high-performance MPI intra-node communication on Linux cluster

  • LiM. et al.

    High performance MPI datatype support with user-mode memory registration: Challenges, designs, and benefits

  • . Linux Kernel, Cross memory attach, https://lwn.net/Articles/405284/. (Online; Accessed April 21,...
  • MIMD lattice computation (MILC), http://physics.indiana.edu/ sg/milc.html. (Online; Accessed April 21,...
  • Message passing interface forum, MPI: A message-passing interface standard,...
  • MVAPICH2: MPI over infiniband, 10GigE/iWARP and RoCE, https://mvapich.cse.ohio-state.edu/. (Online; Accessed April 21,...
  • Cited by (8)

    View all citing articles on Scopus

    Jahanzeb Maqbool Hashmi is a Ph.D. candidate at The Ohio State University where he works at Network-Based Computing Laboratory (NBCL). Before joining NBCL, he was a graduate fellow at the Department of Computer Science and Engineering, Ohio State University. His research is mainly targeted at enabling high-level programming models and runtime systems to achieve high-performance and scalability on dense CPU and GPU architectures. He works on the areas related to shared-memory and shared-address-space communication, topology-aware communication protocols, high-performance deep learning, and optimizations for MPI, PGAS, and MPI+X programming models on modern CPU, accelerators, and interconnects. Prior to joining OSU, he completed his MS in computer engineering at Ajou University, South Korea under the Korean Global IT fellowship where he worked on the performance evaluation and characterization of energy-efficient clusters for scientific workloads. He received his BS from National University of Science and Technology, Pakistan under the Prime Minister’s ICT fellowship.

    Ching-Hsiang Chu is a Ph.D. candidate in Computer Science and Engineering at The Ohio State University, Columbus, Ohio, U.S.A. He received B.S. and M.S. degrees in Computer Science and Information Engineering from National Changhua University of Education, Taiwan in 2010 and National Central University, Taiwan in 2012, respectively. His research interests include high-performance computing, GPU communication, and wireless networks. He is a student member of IEEE and ACM. More details are available at http://web.cse.ohiostate.edu/ chu.368.

    Sourav Chakraborty graduated with a Ph.D. from Ohio State University where he worked at Network Based Computing Laboratory (NBCL). His research interests include Highperformance Computing, MPI and PGAS programming models, kernel assisted collective communication, and designing MPI protocols for emerging architectures. He had made contributions to the MVAPICH2 and MVAPICH2-X projects that are used by wider HPC community.

    Mohammadreza Bayatpour is a 5th year Ph.D. student at Ohio State University in Computer Science and Engineering Department. His research interests are High-Performance Networking and Computing, Scalable Distributed Systems, Parallel Programming Models, and In-Network Computing.

    Hari Subramoni received the Ph.D. degree in Computer Science from The Ohio State University, Columbus, OH, in 2013. He is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, and cloud computing. He has published over 50 papers in international journals and conferences related to these research areas. Recently, Dr. Subramoni is doing research and working on the design and development of MVAPICH2, MVAPICH2-GDR, and MVAPICH2-X software packages. He is a member of IEEE. More details about Dr. Subramoni are available from http://www.cse.ohio-state.edu/ subramon.

    Dhabaleswar K. Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at The Ohio State University. He has published over 450 papers in major journals and international conferences. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software package, developed by his research group (http://mvapich.cse.ohiostate.edu), are currently being used by more than 3,075 organizations worldwide (in 89 countries). This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade (including the current #3). More than 756,000 downloads of this software have taken place from the project’s website alone. He is an IEEE Fellow and a member of ACM. More details about him are available from http://web.cse.ohio-state.edu/ panda.2/.

    View full text