Modeling the effect of application-specific program transformations on energy and performance improvements of parallel ODE solvers
Introduction
Power and energy awareness are important aspects for the design of modern processors. Several features, such as frequency scaling or power capping, have been developed and are integrated in power management units (PMUs) to control the power and energy consumption at hardware level. Recent processors have several performance states (P-states) and CPU operating states (C-states) which allow the hardware to react to different workloads or to turn off unused components to save power. Dynamic voltage and frequency scaling (DVFS) is supported by most processors to reduce the energy consumption.
Besides the importance of hardware mechanisms to control the energy consumption, software aspects are also of high interest, since the software behavior may have a large influence on the resulting energy consumption and execution time. Thus, it is essential to design software such that the resulting energy consumption meets the given requirements, e.g., be as small as possible. Program transformations may play an important role in this context, since they may lead to program versions with a different performance, power or energy consumption. A crucial step towards designing energy-aware software using program transformations is the ability to assess the energy and performance effects of specific program transformations. A final goal is to quantitatively understand which transformation has which effect on the resulting energy and performance behavior. In this article, we contribute to this goal in the context of ODE solvers. In particular, we apply program transformations to create new program versions and explore the effects of several application-specific transformations on the performance and energy consumption of the new program versions on several recent multicore processors. The transformations applied are difficult to find by a compiler due to the complex program structure with intertwined loops and function calls. We also show that power and energy models can help to get insight into the observed performance and energy behavior.
The solution of ordinary differential equations (ODEs) is an important area in scientific computing and, thus, their performance and energy consumption is of large interest. In this article, we investigate embedded Runge-Kutta (RK) methods, which are popular ODE solvers for a broad range of application problems, including discretized time-dependent PDE problems, often resulting in large systems of ODEs [20]. Since embedded RK methods are one-step methods possibly performing a large number of iteration steps with function evaluations of the right-hand sinde function of the ODE system, it is crucial to design each iteration step such that it is efficient in terms of execution time and energy consumption.
To provide a suitable basis for the investigation of the effect of program transformations, we have developed several implementation versions for embedded RK methods resulting from the usage of a series of consecutive application-specific program transformations. The final RK version implements an RK method with delayed function evaluations of the right-hand side of the ODE system. For each solver version, we have developed a multithreaded implementation which exploit the size of the ODE system for an execution on multicore processors in each iteration. However, synchronization is needed between iterations due to data dependencies. The interaction of the multithreaded implementations and the loop transformations lead to further requirements for synchronization points within the individual iterations, which are necessary to guarantee the numerical correctness. This issue is discussed together with the program transformations.
The contributions of this work are in the area of application-specific program transformations and the investigation of their effect on the resulting performance, power, and energy consumption of ODE solvers for five different multicore processors. Several influencing factors are analyzed in detail:
- •
The effect of the computational demands and the memory access characteristics of the specific ODE problem to be solved: Two ODE problems with different characteristics are considered and used as test cases and an experimental evaluation on different desktop and server processors is performed.
- •
The effect of the number of threads used to solve the ODE problems: It is shown that the program transformations may have different effects for a varying number of threads. The resulting performance scalability and energy consumption strongly depends on the computational characteristics of the specific ODE problem to be solved.
- •
The effect of frequency scaling using DVFS on the resulting performance: For different operational frequencies, different performance, power, and energy effects can be observed for the ODE solver versions resulting from the program transformations.
- •
The effect of the processor architecture: The experimental evaluation is performed on five multicore systems with different numbers of cores. It is shown that different multicore systems may have quite different power consumption, resulting in large differences in the overall energy consumption.
The rest of the paper is structured as follows. Section 2 describes the different multithreaded implementation versions of ODE solution methods. Section 3 introduces the power and energy model used for DVFS. Section 4 contains the experimental evaluation on different hardware systems. Section 5 uses the models from Section 3 for a modeling of execution time and energy consumption. Section 6 discusses related work. Section 7 gives some concluding remarks.
Section snippets
Solution methods for ODEs
The program transformations are applied to the explicit RK method, which is briefly summarized in Section 2.1. The series of transformations is described in Section 2.2 and the resulting multithreaded implementations are given in Section 2.3.
Power and energy model
A power and energy model can help to identify the crucial factors of the power and energy consumption. After giving the terminology in Section 3.1, the modeling of the power and energy consumption using DVFS is briefly described in Section 3.2.
Experimental evaluation
The performance and energy behavior of the program versions derived in Section 2 are investigated in the following experimental evaluation using five multicore processors with different architecture. The evaluation considers the execution time, the power consumption P, and the energy consumption E. Since the energy E is defined as , assuming that the program is executed from time t = 0 to time t = tend, the power consumption P(t) at time t may have a strong influence on the
Modeling of power and energy consumption
The observed performance and energy behavior of the different RK versions can be modeled using the power and energy models from Section 3. This will be considered in this section in more detail.
Related work
Runge-Kutta (RK) methods are popular solution methods for ODEs and implementations of RK methods are provided by several numerical libraries. Especially the implementation of DOPRI5 provided by Hairer and Wanner [20] is often used in practice. However, this code is specialized for DOPRI5 with fixed coefficients and does not support parallelism. Sequential implementations of several RK methods including DOPRI5 are also provided by RKSUITE [4], Matlab, and IMSL. Loop transformations to improve
Conclusions
In this article, we have investigated how the performance and energy consumption of parallel ODE solvers can be influenced by applying application-specific program transformations. In particular, we have investigated the behavior of the resulting program versions for different numbers number of threads and different operational frequencies used for the execution. The investigation shows that the program transformations can have a significant effect on the performance and energy consumption.
Author statement
The authors have written the paper and have provided an equal contribution to the content.
Conflict of interest
The authors declare no conflict of interest.
Declaration of Competing Interest
The authors report no declarations of interest.
Acknowledgment
This work is supported by the German Ministry of Science and Education (BMBF), project Self-Adaptation of Time-step-based Simulation Techniques on Heterogeneous HPC Systems (SeASiTe) under project number 01IH16012A/B. Moreover, we thank the LRZ Munich for the access to the Xeon system used for the experimental evaluation.
Thomas Rauber received his PhD degree in computer science from the University des Saarlandes (Saarbrücken) in 1990. From 1996 to 2002, he has been professor for computer science at the Martin-Luther-University Halle-Wittenberg. He jointed the University Bayreuth in 2002 where he now helds the chair for parallel and distributed systems. His research interest include parallel and distributed algorithms, programming environments for parallel and distributed systems, compiler optimizations and
References (48)
Parallel methods for initial value problems
Appl. Numer. Math.
(1993)Massive parallelism across space in ODEs
Appl. Numer. Math.
(1993)- et al.
Energy efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM-based cluster
J. Comput. Phys.
(2013 March) - et al.
A survey of power and energy efficient techniques for high performance numerical linear algebra operations
Parallel Comput.
(2014) - et al.
Parallel iteration of high-order Runge-Kutta methods with stepsize control
J. Comp. Appl. Math.
(1990) - et al.
Parallel Adams Methods
J. Comput. Appl. Math.
(1999) - et al.
Parallel Iteration of high-order Runge-Kutta Methods with stepsize control
J. Comput. Appl. Math.
(1990) - et al.
Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors
Concurr. Comput.: Pract. Exper.
(2015) - et al.
PETSc Users Manual. Technical Report ANL-95/11 – Revision 3.8
(2017) - et al.
Clock-gating of streaming applications for energy efficient implementations on fpgas
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
(2017)
RKSUITE Release 1.0
Parallel and Sequential Methods for Ordinary Differential Equations
A static power model for architects
Proc. of the 33rd Int. Symp. on Microarchitecture (MICRO-33)
Time and energy modeling of a high-performance multi-threaded cholesky factorization
J. Supercomput.
Algorithmic aspects of energy-efficient computing
A step towards energy efficient computing: redesigning a hydrodynamic application on CPU-GPU
Inside 6th-generation intel core: new microarchitecture code-named skylake
IEEE Micro
A Survey of the Explicit Runge-Kutta Method
Power challenges may end the multicore era
Commun. ACM
Accurate energy modelling of hybrid parallel applications on modern heterogeneous computing platforms using system-level measurements
IEEE Access
Energy consumption reduction for asynchronous message passing applications
J. Supercomput.
The design of fast and energy-efficient linear solvers: on the potential of half-precision arithmetic and iterative refinement techniques
Computational Science – ICCS 2018
Investigating power capping toward energy-efficient scientific applications
Concurr. Comput.: Pract. Exp.
Solving Ordinary Differential Equations I: Nonstiff Problems
Cited by (2)
Performance and Energy Evaluation for Solving a Schrödinger-Poisson System on Multicore Processors
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Computational Modelling and Analyzing of Different Cars When U-turn or Lane-change
2022, MEMAT 2022 - 2nd International Conference on Mechanical Engineering, Intelligent Manufacturing and Automation Technology
Thomas Rauber received his PhD degree in computer science from the University des Saarlandes (Saarbrücken) in 1990. From 1996 to 2002, he has been professor for computer science at the Martin-Luther-University Halle-Wittenberg. He jointed the University Bayreuth in 2002 where he now helds the chair for parallel and distributed systems. His research interest include parallel and distributed algorithms, programming environments for parallel and distributed systems, compiler optimizations and performance prediction.
Gudula Rünger received her master and PhD in Mathematics from the University of Cologne and the habilitation in Computer Science from the University of the Saarland. She was a professor for parallel computing and complex systems at the University Leipzig and since 2000 she is a full professor for computer science at the Technical University Chemnitz. She has written more than 200 scientific research paper and is a coauthor of the book Parallel Computing published by Springer Verlag.