Mitigating the processor aging through dynamic concurrency throttling

doi:10.1016/j.jpdc.2021.05.006

Journal of Parallel and Distributed Computing

Volume 156, October 2021, Pages 86-100

https://doi.org/10.1016/j.jpdc.2021.05.006 Get rights and content

Highlights

•
Presents the need for a DCT tool that specifically optimizes the processor aging.
•
Hebe: a transparent aging-aware thread throttling approach for OpenMP applications.
•
It learns, at run-time, the ideal degree of TLP for each parallel region.
•
Hebe outperforms state-of-the-art approaches.
•
Very close results from the best solution given by an exhaustive search.

Abstract

The increase in the number of cores in a single chip brings better capabilities to exploit thread-level parallelism (TLP). However, since power dissipated per area rises at each new node generation, higher temperatures are achieved, speeding up the aging of hardware components, which may provoke undesired system behavior. Considering that many applications do not scale with the number of cores, their execution with the maximum TLP available will not only degrade performance, but also unnecessarily increase temperature, further accelerating aging. Given that, we propose Hebe, a dynamic concurrency throttling approach that learns at run-time the degree of TLP that reduces the aging for OpenMP applications. Hebe is totally transparent, needing no modifications in the original binary code. With a set of extensive experiments (fifteen benchmarks and four multicore platforms), we show that Hebe outperforms state-of-the-art approaches with very close results from the best possible solution given by an exhaustive search.

Introduction

To satisfy the demand for performance of applications from many domains in big data centers and cloud-based systems, the number of cores in a single chip package has been growing at the same pace as the increasing transistor density. However, considering that power dissipated per area rises at each new node generation (i.e., the well-known end of Dennard Scaling), heat dissipation has become a significant issue when exploiting thread-level parallelism (TLP). Besides the common problems, like cooling, the increase of heat dissipation raises the operating temperature, which influences some of the main causes of aging of hardware components (e.g., negative bias temperature instability - NBTI), shortening their lifetime.

NBTI consists of a vital reliability problem in metal-oxide-silicon (MOS) devices. It refers to the generation of positive oxide charge and interface traps in MOS structures under a combination of elevated temperatures and negative gate voltages [62], [11]. This, in turn, increases the threshold voltage ( $V_{t h}$ ), which will have adverse effects on current and propagation delay, degrading the device performance [57]. The impact of NBTI on the processor aging has become more significant in modern devices due to the aggressive down-scaling of device geometry and compact device integration. Both strongly affect the operating temperature, intensifying the processor aging [25]. In the end, this increase in the threshold voltage may provoke undesired system behavior (e.g., electromigration, dielectric breakdown, and stress migration [19]) for many critical applications, further increasing the operating expenses. Therefore, controlling the operating temperature is essential to avoid shortening the hardware lifespan.

Given that, when running a parallel application, the processor temperature rises as the number of threads increases, mainly as a result of the increase in power dissipation due to the switching activity in the hardware components (cores and caches). This behavior can be observed in Fig. 1a for the execution of BT kernel from the NAS Parallel Benchmark [5] on an Intel Xeon 32-core machine (retired from our experiments, discussed in Section 3). It shows the average CPU power and temperature for the application execution with a different number of threads (from 2 to 32). As one can observe by comparing Fig. 1a and Fig. 1b, the average NBTI per second of execution (raw numbers got from Eq. (1)) proportionally grows with the temperature rise. Therefore, there is a trade-off between temperature rise, the benefits it brings to lower the total execution time, and the impact they have on aging due to the NBTI [43], [44]. In the end, they are all directly related to how many threads are distributed across the cores in a parallel application.

Nevertheless, many applications do not scale as the number of threads increases due to several hardware related reasons: Instruction issue width saturation, off-chip bus saturation, data synchronization, and concurrent shared memory accesses [65], [64], [52], [39]. It means that in many cases, executing a given application in the regular way (i.e., splitting the application into as many threads as possible to use all the available cores) will result in non-optimal use of the available resources, not delivering the best trade-off between performance and temperature and accelerating the impact of NBTI on the aging process. Therefore, by artificially decreasing the number of threads (i.e., thread throttling) for some parallel regions, one can rightly tune the parameters mentioned above to achieve the best performance/temperature ratio and reduce the impact of NBTI on the processor aging. We show that it is possible in Fig. 2 for the execution of Streamcluster application from the Rodinia benchmark suite [17]: the lowest value of NBTI (Fig. 2b) is achieved when running the application with eight threads, which has the best trade-off between performance and temperature (Fig. 2a). Hence, by executing this application with the ideal degree of TLP (e.g., eight threads) instead of with the maximum number of threads (which would be the default behavior), the impact of NBTI on the processor aging is 24% lower.

Given the scenario above, we propose Hebe, a transparent aging-aware thread throttling approach for OpenMP Applications. By using an efficient search algorithm, it automatically learns, at run-time, the ideal number of threads for each parallel region aiming to reduce the impact of NBTI on the processor aging. This dynamic capability to adapt is key, since thread balancing will vary depending on intrinsic characteristics of the parallel application at hand (e.g., input set and number of parallel regions), as well as the microarchitecture on which it will execute (e.g., number of cores and instruction-set architecture). Hebe improves previous work [43] by considering the optimization of a more realistic phenomena on the processor aging (i.e. it uses NBTI as the main metric).

We validate Hebe through the execution of fifteen well-known benchmarks on four distinct multicore platforms (AMD and Intel) We compare Hebe to different scenarios: i) the standard way that parallel applications are executed (STD), that is, with the maximum possible number of threads available; ii) a built in feature of OpenMP that dynamically changes the number of threads (OMP_Dynamic); iii) a thermal-aware adaptive energy minimization approach for OpenMP applications proposed by Shafik et al. [60] (TA-OMP). We show that, by using Hebe, the impact of NBTI on the processor aging is up to 80% lower than the one presented by the STD configuration; 87% lower than the OMP_Dynamic; and 91% lower than the TA-OMP approach.

In order to reinforce the need for a tool that specifically optimizes the impact of NBTI on the processor aging, we also compare the results of Hebe when the objective function is changed to optimize performance, energy, or EDP instead of aging. Fig. 3 presents the NBTI of each configuration normalized to Hebe w.r.t. the geometric mean of the entire benchmark set on each multicore processor. With that, we show that the performance, energy, and EDP oriented setup present an NBTI 12%, 12%, and 10% higher than the original objective function, respectively. We also compare Hebe to two well-known approaches that target performance and energy rather than aging: the FDT [65] and the Varuna programming model [61]. Finally, to measure the cost of the learning curve for converging to the ideal number of threads brought by its dynamic adaptation, we also compare Hebe to the solution provided by an exhaustive search. It executes each parallel region with its predefined ideal number of threads, without the learning overheads. From this, we found that the average cost of Hebe to reduce the impact of NBTI on the processor aging is less than 1.0% for most of the benchmarks.

The remainder of the manuscript is organized as follows. Hebe is described in Section 2. The benchmarks and execution environment used to validate Hebe are described in Section 3. We discuss the results in Section 4. We describe the related work and compare it to Hebe in Section 5, while the final considerations are drawn in Section 6.

Section snippets

Hebe: applying dynamic concurrency throttling to mitigate the processor aging

Hebe aims to reduce the impact of NBTI on the processor aging by tuning the degree of TLP of each parallel region of an OpenMP application. The general workflow of Hebe can be observed in Fig. 4: Given an OpenMP application binary (in this example, it has three parallel regions), Hebe first applies a search algorithm over each parallel region during the learning phase (described in Section 2.2) to find the number of threads that delivers the lowest impact of NBTI on the processor aging. Once

Benchmarks

Fifteen well-known applications written in C/C++ and already parallelized with OpenMP from assorted suites were chosen:

•
Six kernels from the NAS Parallel Benchmark [5]: Block Tri-diagonal solver (BT-NAS), Conjugate Gradient (CG-NAS), discrete 3D fast Fourier Transform (FT-NAS), Lower-Upper Gauss-Seidel solver (LU-NAS), Scalar Penta-diagonal solver (SP-NAS), and Unstructured Adaptive mesh (UA-NAS). As the original version of NAS is written in FORTRAN, we use its OpenMP-C version [59].
•
Three

Results

In this section, we present the results achieved by Hebe and compare them to the approaches defined in the previous section. Hence, we start by discussing the convergence of the search algorithm implemented by Hebe in Section 4.1. Then, we discuss in Section 4.2 what is the outcome of converging to an ideal or near-ideal number of threads for each parallel region by comparing the results of Hebe to the OpenMP regular execution, the OpenMP dynamic feature, and the thermal-aware approach proposed

Related work

We have split this section into three main parts: techniques to reduce aging at different levels are discussed in Section 5.1; approaches for thread throttling (but that do not target aging) in Section 5.2; and how Hebe advances correlates with them in Section 5.3.

Concluding remarks

In this manuscript, we have presented Hebe. It is an approach for OpenMP applications that reduces the impact of NBTI on the processor aging by finding the degree of TLP that offers the best trade-off between performance and processor temperature. Hebe optimizes an OpenMP application binary by only setting an environment variable in Linux OS without any code transformation or recompilation. Through an extensive set of experiments, we have shown that Hebe can reduce the impact of NBTI on the

CRediT authorship contribution statement

Thiarles S. Medeiros: Data curation, Methodology, Software, Writing – original draft. Luan Pereira: Investigation, Software, Validation, Writing – original draft. Fábio D. Rossi: Writing – original draft, Writing – review & editing. Marcelo C. Luizelli: Formal analysis, Writing – original draft, Writing – review & editing. Antonio Carlos S. Beck: Conceptualization, Writing – original draft, Writing – review & editing. Arthur F. Lorenzon: Conceptualization, Supervision, Writing – original draft,

Declaration of Competing Interest

There is no conflict of interest.

Thiarles Soares Medeiros is a MSc Student in Software Engineering at the Federal University of Pampa (UNIPAMPA), Brazil. Thiarles received his BS. in Computer Science from the UNIPAMPA, in 2012. His areas of interest include the parallelism exploitation in multicore systems and the design of approaches to automate and optimize the thread-level parallelism exploitation.

References (69)

M.J. Abraham et al.
Gromacs: high performance molecular simulations through multi-level parallelism from laptops to supercomputers
SoftwareX
(2015)
J.H. Stathis et al.
The negative bias temperature instability in mos devices: a review
Microelectron. Reliab.
(2006)
D. Taborda et al.
Application of a hill-climbing technique to the formulation of a new cyclic nonlinear elastic constitutive model
Comput. Geotech.
(2012)
F. Ahmed et al.
Wearout-aware compiler-directed register assignment for embedded systems
F. Alessi et al.
Application-Level Energy Awareness for OpenMP
(2015)
V. Aslot et al.
SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance
(2001)
D.H. Bailey et al.
The nas parallel benchmarks; summary and preliminary results
M. Bao et al.
On-line thermal aware dynamic voltage scaling for energy optimization with frequency/temperature dependency consideration
A. Bartolini et al.
A distributed and self-calibrating model-predictive controller for energy and thermal management of high-performance multicores
S. Bhatt et al.
Abstractions for parallel n-body simulations

C. Bienia et al.

The parsec benchmark suite: characterization and architectural implications

G. Blake et al.

Evolution of thread-level parallelism in desktop applications

Comput. Archit. News

(2010)

C. Blat et al.

Mechanism of negative-bias-temperature instability

J. Appl. Phys.

(1991)

M. Cacciari et al.

Thermal and energy management of high-performance multicores: distributed and self-calibrating model-predictive controller

IEEE Trans. Parallel Distrib. Syst.

(2013)

G. Chadha et al.

When less is more (limo): controlled parallelism for improved efficiency

T. Chantem et al.

Online work maximization under a peak temperature constraint

T. Chantem et al.

Temperature-aware scheduling and assignment for hard real-time applications on mpsocs

IEEE Trans. Very Large Scale Integr. Syst.

(2011)

B. Chapman et al.

Using OpenMP: Portable Shared Memory Parallel Programming

(2007)

S. Che et al.

Rodinia: a benchmark suite for heterogeneous computing

M. Cho et al.

Power multiplexing for thermal field management in many-core processors

IEEE Trans. Compon. Packag. Manuf. Technol.

(2013)

S. Corbetta et al.

Nbti mitigation in microprocessor designs

A.K. Coskun et al.

Proactive temperature balancing for low cost thermal management in mpsocs

F. Firouzi et al.

Nbti mitigation by optimized nop assignment and insertion

Y. Ge et al.

Distributed task migration for thermal management in many-core systems

B. Goel et al.

Portable, scalable, per-core power estimation for intelligent resource management

M. Gomaa et al.

Heat-and-run: leveraging smt and cmp to manage power density through the operating system

W. Gös

Hole trapping and the negative bias temperature instability

(2011)

D. Hackenberg et al.

Power measurement techniques on standard compute nodes: a quantitative comparison

V. Hanumaiah et al.

Temperature-aware dvfs for hard real-time applications on multicore processors

IEEE Trans. Comput.

(2012)

J. Henkel et al.

Thermal management for dependable on-chip systems

V. Heuveline et al.

The openlb project: an open source and object oriented implementation of lattice Boltzmann methods

Int. J. Mod. Phys. C

(2007)

W.-W. Hsieh et al.

Thermal-aware post compilation for vliw architectures

Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory, Tech. Rep....

I. Karlin et al.

Lulesh 2.0 updates and changes

(August 2013)

Cited by (5)

Seamless Thermal Optimization of Parallel Workloads
2023, IEEE Design and Test
Harnessing the Effects of Process Variability to Mitigate Aging in Cloud Servers
2023, Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI
Machine learning based workload balancing scheme for minimizing stress migration induced aging in multicore processors
2023, International Journal of Information Technology (Singapore)
ENERGY-EFFICIENCY EVALUATION OF OPENMP LOOP TRANSFORMATIONS AND RUNTIME CONSTRUCTS
2022, arXiv
Thermal-Aware Thread and Turbo Frequency Throttling Optimization for Parallel Applications
2022, 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design, SBCCI 2022 - Proceedings

Luan Pereira Vargas is a graduate student in Computer Science at the UNIPAMPA, Brazil. His areas of interest include high-performance computing, image processing, computer vision, and data analysis.

Fábio Diniz Rossi is a Lecturer at the Federal Institute of Science, Education and Technology Farroupilha (IFFar, Alegrete, RS, Brazil). He holds BS degree in Informatics from the University of the Region of Campanha (URCAMP, Brazil, 2000), MSc (2008) and PhD (2016) degrees in Computer Science from the Pontifical Catholic University of Rio Grande do Sul (PUCRS, Brazil). Dr. Rossi was a visitor at the University of Melbourne, Australia (2014-2015). Cisco Certified Network Associate (CCNA) Instructor. His primary research interests include Fog/Edge/Cloud computing.

Marcelo Caggiani Luizelli held MSc and PhD degree in Computer Science from Federal University of Rio Grande do Sul (UFRGS) under the supervision of Prof. Dr. Luciano Paschoal Gaspary and Prof. Dr. Luciana Salete Buriol. In 2015-2016, Dr. Luizelli was visiting the Computer Science Department of Technion University and NOKIA Bell Labs (Israel) under the supervision of Prof. Danny Raz. His current research interests are in the field of computer networks, focusing on the design of algorithms, and optimization techniques. Recently, he has been working with Network Function Virtualization, Software-Defined Networks, and Programmable Data Planes.

Antonio Carlos Schneider Beck Filho received his PhD degree from the Federal University of Rio Grande do Sul (UFRGS), Brazil, in 2008. Currently, he is an associate professor at the Applied Informatics Department at the Informatics Institute of UFRGS, in charge of Embedded Systems and Computer Organization disciplines at the undergraduate and graduate levels. His primary research interests include computer architectures and embedded systems design, focusing on power consumption. For more information, access www.inf.ufrgs.br/~caco.

Arthur Francisco Lorenzon received his PhD degree from the Federal University of Rio Grande do Sul (UFRGS), Brazil, in 2018. Currently, he is a professor at UNIPAMPA, Campus Alegrete, in charge of Computer Organization and Parallel Computing disciplines at the undergraduate and graduate levels. His areas of interest include the parallelism exploitation in multicore systems, evaluation of different parallel programming interfaces, and the design of approaches to automate and optimize the TLP exploitation.

View full text