Efficient OpenMP parallelization to a complex MPI parallel magnetohydrodynamics code

https://doi.org/10.1016/j.jpdc.2020.02.004Get rights and content

Highlights

  • OpenMP+MPI parallelization strategy is implemented for a complex MHD code.

  • Efficient scalings for MHD are obtained up to 0.5 million cores.

  • Coarse-grain multi-threading performance depends on compiler’s capability.

Abstract

The state-of-the-art finite volume/difference magnetohydrodynamics (MHD) code Block Adaptive Tree Solarwind Roe Upwind Scheme (BATS-R-US) was originally designed with pure MPI parallelization. The maximum problem size achievable was limited by the storage requirements of the block tree structure. To mitigate this limitation, we have added multi-threaded OpenMP parallelization to the previous pure MPI implementation. We opted to use a coarse-grained approach by making the loops over grid blocks multi-threaded and have succeeded in making BATS-R-US an efficient hybrid parallel code with modest changes in the source code while preserving the performance. Good weak scalings up to hundreds of thousands of cores were achieved both for explicit and implicit time stepping schemes. This parallelization strategy greatly extended the possible simulation scale from 16,000 cores to more than 500,000 cores with 2GB/core memory on the Blue Waters supercomputer. Our work also revealed significant performance issues for some of the compilers when the code is compiled with the OpenMP library, probably related to the less efficient optimization of a complex multi-threaded region.

Introduction

The Block-Adaptive-Tree Solarwind Roe Upwind Scheme (BATS-R-US) [7], [19] is a multi-physics MHD code written in Fortran 90+ that has been actively developing at the University of Michigan for over 20 years. It is the most complex and often the computationally most expensive model in the Space Weather Modeling Framework (SWMF) [1], [29], [30] that has been applied to simulate multi-scale space physical systems including, but not limited to, the solar corona, the heliosphere, planetary magnetospheres, moons, comets and the outer heliosphere. For the purpose of adaptive mesh refinement (AMR) and running efficiency, the code was designed from the very beginning to use a 3D Cartesian block-adaptive mesh with MPI parallelization [7], [26]. In 2012, the original block-adaptive implementation has been replaced with the newly designed and implemented Block Adaptive Tree Library (BATL) [29] for creating, adapting, load-balancing and message-passing a 1, 2, or 3 dimensional block-adaptive grid in generalized coordinates. The major advantages of adaptive block approach include locally structured grid in each block, cache optimizations due to relatively small arrays associated with the grid blocks, loop optimization for fixed sized loops over cells in the block, and simple load balancing. Larger blocks reduce the total number of ghost cells surrounding the grid blocks, but make the grid adaptivity less economic. Smaller blocks allow precise grid adaptation, but require a large number of blocks and more storage and computation spent on ghost cells. The typical choice of block size in 3D ranges between 43 to 163 grid cells, with an additional 1–3 layers of ghost cells on each side depending on the order of the numerical scheme.

BATS-R-US has been gradually evolving into a comprehensive code by adding new schemes as well as new physical models. Currently, 60 equation sets from ideal hydrodynamics to the most recent six-moment fluid model [11] are available. The most important applications solve various forms of the magnetohydrodynamic (MHD) equations, including resistive, Hall, semi-relativistic, multi-species and multi-fluid MHD, optionally with anisotropic pressure, radiative transport and heat conduction. There are several choices of numerical schemes for the Riemann solvers, from the original Roe scheme to many others combined with a second order total variation diminishing (TVD) scheme or a fifth order accurate conservative finite difference scheme [5]. The time discretization can be explicit, point-implicit, semi-implicit, explicit/implicit or fully implicit. A high level abstraction of the code structure is presented in Fig. 1.

A powerful feature of BATS-R-US is the incorporation of user modules. This is an interface for users to modify literally any part of the kernel code without interfering with other modules. It provides a neat and easy way to gain high-level control of the simulations. Currently there are 51 different user modules in the repository, mostly used for the setup of specific initial and boundary conditions and additional user-defined source terms for the specific applications.

One of the key features of BATS-R-US is its excellent scalability on supercomputers. Previous benchmarks [29] with pure MPI parallelization have shown good strong scaling up to 8192 cores and weak scaling up to 16,384 cores within the memory limit of the testing platform. However, the grid-related pre-calculated information replicated on every MPI process for simplifying the refinement algorithm have generated an unavoidable memory redundancy on computational nodes. To increase the scalability to even larger sizes, we need to reorganize the code and come up with a more advanced solution.

In this work, we have extended BATS-R-US with a hybrid MPI + OpenMP parallelization that significantly mitigates the limitations due to available memory. The strategies and issues are described in the next two sections, followed by performance test results and discussions.

Section snippets

Hybrid parallelization strategy

BATS-R-US was originally designed for pure MPI parallelization and did not take advantage of the rapid development of shared-memory multi-threading programming starting from late 1990s [4], [6]. Even though MPI is generally observed to give better parallel scaling than OpenMP due to forced data locality, one obvious shortcoming of the pure MPI implementation is wasteful memory usage. In BATS-R-US, we support 1, 2, and 3 dimensional block-adaptive grids, where each block contains n1×n2×n3 cells.

Overview

There are two high level goals while modifying and improving the code:

  • 1.

    Backward compatibility: the code should still work correctly and efficiently without the OpenMP compilation flag.

  • 2.

    Minimize work effort and code changes as much as possible.

BATS-R-US is able to solve the system of partial differential equations with a mixture of explicit and implicit timestepping blocks distributed among the MPI processes. We treat first the explicit and then the implicit modules and add OpenMP directives

Nightly tests

For a comprehensive quality check and verification of the SWMF and the physics models contained (including BATS-R-US), we have built an automated nightly test suite for testing the code with various setups on various platforms. The latest version of the code is checked out from a central Git repository and 100+ tests are performed on multiple platforms with different compilers, compiler flags and number of cores. The test results are monitored every day and have been archived since 2009. These

Performance

We have performed some standard Brio–Wu MHD shock tube tests on various platforms. A 3D Cartesian grid is chosen, and dynamic AMR has not been employed. The magnetic monopole is controlled by hyperbolic cleaning [29] and 8-wave scheme [19]. Despite its simplicity, this is a fairly representative test for various applications in terms of computational cost per grid cell, as well as exercising the most important parts of the BATS-R-US code.

Conclusion

In this work, we have successfully extended our finite volume/difference MHD code BATS-R-US from pure MPI to MPI + OpenMP hybrid implementation, with only 0.25% modification to the 250,000 lines of source code. Good weak scaling performances are obtained up to 500,000 cores with explicit time stepping and up to 250,000 cores with implicit time stepping. Using the hybrid parallelization, we are now able to solve problems more than an order of magnitude larger than before thanks to the usage

CRediT authorship contribution statement

Hongyang Zhou: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Visualization. Gábor Tóth: Methodology, Resources, Supervision, Funding acquisition.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.jpdc.2020.02.004.

Acknowledgments

The authors are thankful for the useful comments and suggestions by the reviewers. This research was supported by NSF INSPIRE, United States grant number PHY-1513379. The computational resources were funded by Blue Waters GLCPC, United States and NSF Frontera, United States .

Hongyang Zhou, Research Assistant is a PhD student at University of Michigan, Ann Arbor. He has been working on the coupled MHD-EPIC simulation of Ganymede’s magnetosphere. He also has research interest in high performance plasma simulations with various techniques.

References (34)

  • ChandraR. et al.

    Parallel Programming in OpenMP

    (2001)
  • DagumL. et al.

    Openmp: An industry-standard api for shared-memory programming

    Comput. Sci. Eng.

    (1998)
  • De ZeeuwD.L. et al.

    An adaptive mhd method for global space weather simulations

    IEEE Trans. Plasma Sci.

    (2000)
  • Dinesh K. KaushikW.D.G. et al.

    Using memory performance to understand the mixed mpi/openmp model

    (2019)
  • DrosinosN. et al.

    Performance comparison of pure mpi vs hybrid mpi-openmp parallelization models on smp clusters

  • Z. Huang, G. Tóth, B. van der Holst, Y. Chen, T. Gombosi, A six-moment multi-fluid plasma model. J. Comput....
  • G. Jost, H.-Q. Jin, D. anMey, F.F. Hatay, Comparing the openmp, mpi, and hybrid programming paradigm on an smp...
  • Hongyang Zhou, Research Assistant is a PhD student at University of Michigan, Ann Arbor. He has been working on the coupled MHD-EPIC simulation of Ganymede’s magnetosphere. He also has research interest in high performance plasma simulations with various techniques.

    Dr. Gábor Tóth, Research Professor is an expert in algorithm and code development for space and plasma physics simulations. He has a leading role in the development of Space Weather Modeling Framework that can couple and execute about a dozen different space physics models modeling domains from the surface of the Sun to the upper atmosphere of the Earth. He is one of the main developers of the BATS-R-US code, a multi-physics and multi-application MHD code using block-adaptive grids. He has participated in designing the software architect for the Center for Radiative Shock Hydrodynamics. He also designed the Versatile Advection Code, a general purpose publicly available hydrodynamics and MHD code.

    View full text