Modeling the effect of application-specific program transformations on energy and performance improvements of parallel ODE solvers

https://doi.org/10.1016/j.jocs.2021.101356Get rights and content

Highlights

  • The article investigates multithreaded solution methods for ODEs with a focus on their performance and energy behavior.

  • Application-specific program transformations used to modify the memory access behavior of the ODE solvers are investigated.

  • A detailed investigation of the resulting performance and energy consumption is presented for five different multicore systems.

  • The investigation considers different aspects, such as the ODE problem to be solved, the number of threads used, or the usage of frequency scaling.

  • An analytical power and energy model is used for capturing the performance and energy consumption.

Abstract

Ordinary differential equations (ODEs) are important for modelling many problems from science and engineering and efficient ODE solvers are required, for example when solving time-dependent partial differential equations (PDEs) with the method of lines. Since an ODE solver may perform a large number of iteration steps, the execution time for solving an ODE problem might be quite large. Thus, a reduction of the execution time is desirable and should affect each iteration step of the simulation. Programming techniques to reduce the execution time of ODE solver are parallelism and modification of the memory access structure such that the memory access time decreases. In this article, we investigate multithreaded solution methods for ODEs with different memory access behavior and their influence on the performance. Additionally the energy consumption is considered. The parallelism is implemented as shared memory program for multicore processors. The memory access behavior is investigated using different program variants which result from application-specific program transformations changing the memory access order while guaranteeing the numerical correctness. For the investigation of the performance, experimental data have been gathered on five different recent multicore processors. Additionally, an analytical power and energy model for modeling the performance and energy consumption is introduced. As ODE solver, the popular embedded Runge-Kutta methods with error correction is used. The simulation problems are two different ODEs resulting from discretized PDEs. The experimental data give insight into the quite diverse performance behavior of the ODE solver variants solving the same problem on different platforms.

Introduction

Power and energy awareness are important aspects for the design of modern processors. Several features, such as frequency scaling or power capping, have been developed and are integrated in power management units (PMUs) to control the power and energy consumption at hardware level. Recent processors have several performance states (P-states) and CPU operating states (C-states) which allow the hardware to react to different workloads or to turn off unused components to save power. Dynamic voltage and frequency scaling (DVFS) is supported by most processors to reduce the energy consumption.

Besides the importance of hardware mechanisms to control the energy consumption, software aspects are also of high interest, since the software behavior may have a large influence on the resulting energy consumption and execution time. Thus, it is essential to design software such that the resulting energy consumption meets the given requirements, e.g., be as small as possible. Program transformations may play an important role in this context, since they may lead to program versions with a different performance, power or energy consumption. A crucial step towards designing energy-aware software using program transformations is the ability to assess the energy and performance effects of specific program transformations. A final goal is to quantitatively understand which transformation has which effect on the resulting energy and performance behavior. In this article, we contribute to this goal in the context of ODE solvers. In particular, we apply program transformations to create new program versions and explore the effects of several application-specific transformations on the performance and energy consumption of the new program versions on several recent multicore processors. The transformations applied are difficult to find by a compiler due to the complex program structure with intertwined loops and function calls. We also show that power and energy models can help to get insight into the observed performance and energy behavior.

The solution of ordinary differential equations (ODEs) is an important area in scientific computing and, thus, their performance and energy consumption is of large interest. In this article, we investigate embedded Runge-Kutta (RK) methods, which are popular ODE solvers for a broad range of application problems, including discretized time-dependent PDE problems, often resulting in large systems of ODEs [20]. Since embedded RK methods are one-step methods possibly performing a large number of iteration steps with function evaluations of the right-hand sinde function of the ODE system, it is crucial to design each iteration step such that it is efficient in terms of execution time and energy consumption.

To provide a suitable basis for the investigation of the effect of program transformations, we have developed several implementation versions for embedded RK methods resulting from the usage of a series of consecutive application-specific program transformations. The final RK version implements an RK method with delayed function evaluations of the right-hand side of the ODE system. For each solver version, we have developed a multithreaded implementation which exploit the size of the ODE system for an execution on multicore processors in each iteration. However, synchronization is needed between iterations due to data dependencies. The interaction of the multithreaded implementations and the loop transformations lead to further requirements for synchronization points within the individual iterations, which are necessary to guarantee the numerical correctness. This issue is discussed together with the program transformations.

The contributions of this work are in the area of application-specific program transformations and the investigation of their effect on the resulting performance, power, and energy consumption of ODE solvers for five different multicore processors. Several influencing factors are analyzed in detail:

  • The effect of the computational demands and the memory access characteristics of the specific ODE problem to be solved: Two ODE problems with different characteristics are considered and used as test cases and an experimental evaluation on different desktop and server processors is performed.

  • The effect of the number of threads used to solve the ODE problems: It is shown that the program transformations may have different effects for a varying number of threads. The resulting performance scalability and energy consumption strongly depends on the computational characteristics of the specific ODE problem to be solved.

  • The effect of frequency scaling using DVFS on the resulting performance: For different operational frequencies, different performance, power, and energy effects can be observed for the ODE solver versions resulting from the program transformations.

  • The effect of the processor architecture: The experimental evaluation is performed on five multicore systems with different numbers of cores. It is shown that different multicore systems may have quite different power consumption, resulting in large differences in the overall energy consumption.

The rest of the paper is structured as follows. Section 2 describes the different multithreaded implementation versions of ODE solution methods. Section 3 introduces the power and energy model used for DVFS. Section 4 contains the experimental evaluation on different hardware systems. Section 5 uses the models from Section 3 for a modeling of execution time and energy consumption. Section 6 discusses related work. Section 7 gives some concluding remarks.

Section snippets

Solution methods for ODEs

The program transformations are applied to the explicit RK method, which is briefly summarized in Section 2.1. The series of transformations is described in Section 2.2 and the resulting multithreaded implementations are given in Section 2.3.

Power and energy model

A power and energy model can help to identify the crucial factors of the power and energy consumption. After giving the terminology in Section 3.1, the modeling of the power and energy consumption using DVFS is briefly described in Section 3.2.

Experimental evaluation

The performance and energy behavior of the program versions derived in Section 2 are investigated in the following experimental evaluation using five multicore processors with different architecture. The evaluation considers the execution time, the power consumption P, and the energy consumption E. Since the energy E is defined as E=t=0tendP(t)dt, assuming that the program is executed from time t = 0 to time t = tend, the power consumption P(t) at time t may have a strong influence on the

Modeling of power and energy consumption

The observed performance and energy behavior of the different RK versions can be modeled using the power and energy models from Section 3. This will be considered in this section in more detail.

Related work

Runge-Kutta (RK) methods are popular solution methods for ODEs and implementations of RK methods are provided by several numerical libraries. Especially the implementation of DOPRI5 provided by Hairer and Wanner [20] is often used in practice. However, this code is specialized for DOPRI5 with fixed coefficients and does not support parallelism. Sequential implementations of several RK methods including DOPRI5 are also provided by RKSUITE [4], Matlab, and IMSL. Loop transformations to improve

Conclusions

In this article, we have investigated how the performance and energy consumption of parallel ODE solvers can be influenced by applying application-specific program transformations. In particular, we have investigated the behavior of the resulting program versions for different numbers number of threads and different operational frequencies used for the execution. The investigation shows that the program transformations can have a significant effect on the performance and energy consumption.

Author statement

The authors have written the paper and have provided an equal contribution to the content.

Conflict of interest

The authors declare no conflict of interest.

Declaration of Competing Interest

The authors report no declarations of interest.

Acknowledgment

This work is supported by the German Ministry of Science and Education (BMBF), project Self-Adaptation of Time-step-based Simulation Techniques on Heterogeneous HPC Systems (SeASiTe) under project number 01IH16012A/B. Moreover, we thank the LRZ Munich for the access to the Xeon system used for the experimental evaluation.

Thomas Rauber received his PhD degree in computer science from the University des Saarlandes (Saarbrücken) in 1990. From 1996 to 2002, he has been professor for computer science at the Martin-Luther-University Halle-Wittenberg. He jointed the University Bayreuth in 2002 where he now helds the chair for parallel and distributed systems. His research interest include parallel and distributed algorithms, programming environments for parallel and distributed systems, compiler optimizations and

References (48)

  • R.W. Brankin et al.

    RKSUITE Release 1.0

    (1991)
  • K. Burrage

    Parallel and Sequential Methods for Ordinary Differential Equations

    (1995)
  • J.A. Butts et al.

    A static power model for architects

    Proc. of the 33rd Int. Symp. on Microarchitecture (MICRO-33)

    (2000)
  • S. Catalán et al.

    Time and energy modeling of a high-performance multi-threaded cholesky factorization

    J. Supercomput.

    (2017 January)
  • M. Chrobak

    Algorithmic aspects of energy-efficient computing

  • T. Dong et al.

    A step towards energy efficient computing: redesigning a hydrodynamic application on CPU-GPU

  • J. Doweck et al.

    Inside 6th-generation intel core: new microarchitecture code-named skylake

    IEEE Micro

    (2017)
  • W.H. Enright et al.

    A Survey of the Explicit Runge-Kutta Method

    (1995)
  • H. Esmaeilzadeh et al.

    Power challenges may end the multicore era

    Commun. ACM

    (2013 February)
  • M. Fahad et al.

    Accurate energy modelling of hybrid parallel applications on modern heterogeneous computing platforms using system-level measurements

    IEEE Access

    (2020)
  • A. Fanfakh et al.

    Energy consumption reduction for asynchronous message passing applications

    J. Supercomput.

    (2017)
  • A. Haidar et al.

    The design of fast and energy-efficient linear solvers: on the potential of half-precision arithmetic and iterative refinement techniques

    Computational Science – ICCS 2018

    (2018)
  • A. Haidar et al.

    Investigating power capping toward energy-efficient scientific applications

    Concurr. Comput.: Pract. Exp.

    (2019)
  • E. Hairer et al.

    Solving Ordinary Differential Equations I: Nonstiff Problems

    (1993)
  • Cited by (2)

    Thomas Rauber received his PhD degree in computer science from the University des Saarlandes (Saarbrücken) in 1990. From 1996 to 2002, he has been professor for computer science at the Martin-Luther-University Halle-Wittenberg. He jointed the University Bayreuth in 2002 where he now helds the chair for parallel and distributed systems. His research interest include parallel and distributed algorithms, programming environments for parallel and distributed systems, compiler optimizations and performance prediction.

    Gudula Rünger received her master and PhD in Mathematics from the University of Cologne and the habilitation in Computer Science from the University of the Saarland. She was a professor for parallel computing and complex systems at the University Leipzig and since 2000 she is a full professor for computer science at the Technical University Chemnitz. She has written more than 200 scientific research paper and is a coauthor of the book Parallel Computing published by Springer Verlag.

    View full text