Mitigating the processor aging through dynamic concurrency throttling
Introduction
To satisfy the demand for performance of applications from many domains in big data centers and cloud-based systems, the number of cores in a single chip package has been growing at the same pace as the increasing transistor density. However, considering that power dissipated per area rises at each new node generation (i.e., the well-known end of Dennard Scaling), heat dissipation has become a significant issue when exploiting thread-level parallelism (TLP). Besides the common problems, like cooling, the increase of heat dissipation raises the operating temperature, which influences some of the main causes of aging of hardware components (e.g., negative bias temperature instability - NBTI), shortening their lifetime.
NBTI consists of a vital reliability problem in metal-oxide-silicon (MOS) devices. It refers to the generation of positive oxide charge and interface traps in MOS structures under a combination of elevated temperatures and negative gate voltages [62], [11]. This, in turn, increases the threshold voltage (), which will have adverse effects on current and propagation delay, degrading the device performance [57]. The impact of NBTI on the processor aging has become more significant in modern devices due to the aggressive down-scaling of device geometry and compact device integration. Both strongly affect the operating temperature, intensifying the processor aging [25]. In the end, this increase in the threshold voltage may provoke undesired system behavior (e.g., electromigration, dielectric breakdown, and stress migration [19]) for many critical applications, further increasing the operating expenses. Therefore, controlling the operating temperature is essential to avoid shortening the hardware lifespan.
Given that, when running a parallel application, the processor temperature rises as the number of threads increases, mainly as a result of the increase in power dissipation due to the switching activity in the hardware components (cores and caches). This behavior can be observed in Fig. 1a for the execution of BT kernel from the NAS Parallel Benchmark [5] on an Intel Xeon 32-core machine (retired from our experiments, discussed in Section 3). It shows the average CPU power and temperature for the application execution with a different number of threads (from 2 to 32). As one can observe by comparing Fig. 1a and Fig. 1b, the average NBTI per second of execution (raw numbers got from Eq. (1)) proportionally grows with the temperature rise. Therefore, there is a trade-off between temperature rise, the benefits it brings to lower the total execution time, and the impact they have on aging due to the NBTI [43], [44]. In the end, they are all directly related to how many threads are distributed across the cores in a parallel application.
Nevertheless, many applications do not scale as the number of threads increases due to several hardware related reasons: Instruction issue width saturation, off-chip bus saturation, data synchronization, and concurrent shared memory accesses [65], [64], [52], [39]. It means that in many cases, executing a given application in the regular way (i.e., splitting the application into as many threads as possible to use all the available cores) will result in non-optimal use of the available resources, not delivering the best trade-off between performance and temperature and accelerating the impact of NBTI on the aging process. Therefore, by artificially decreasing the number of threads (i.e., thread throttling) for some parallel regions, one can rightly tune the parameters mentioned above to achieve the best performance/temperature ratio and reduce the impact of NBTI on the processor aging. We show that it is possible in Fig. 2 for the execution of Streamcluster application from the Rodinia benchmark suite [17]: the lowest value of NBTI (Fig. 2b) is achieved when running the application with eight threads, which has the best trade-off between performance and temperature (Fig. 2a). Hence, by executing this application with the ideal degree of TLP (e.g., eight threads) instead of with the maximum number of threads (which would be the default behavior), the impact of NBTI on the processor aging is 24% lower.
Given the scenario above, we propose Hebe, a transparent aging-aware thread throttling approach for OpenMP Applications. By using an efficient search algorithm, it automatically learns, at run-time, the ideal number of threads for each parallel region aiming to reduce the impact of NBTI on the processor aging. This dynamic capability to adapt is key, since thread balancing will vary depending on intrinsic characteristics of the parallel application at hand (e.g., input set and number of parallel regions), as well as the microarchitecture on which it will execute (e.g., number of cores and instruction-set architecture). Hebe improves previous work [43] by considering the optimization of a more realistic phenomena on the processor aging (i.e. it uses NBTI as the main metric).
We validate Hebe through the execution of fifteen well-known benchmarks on four distinct multicore platforms (AMD and Intel) We compare Hebe to different scenarios: i) the standard way that parallel applications are executed (STD), that is, with the maximum possible number of threads available; ii) a built in feature of OpenMP that dynamically changes the number of threads (OMP_Dynamic); iii) a thermal-aware adaptive energy minimization approach for OpenMP applications proposed by Shafik et al. [60] (TA-OMP). We show that, by using Hebe, the impact of NBTI on the processor aging is up to 80% lower than the one presented by the STD configuration; 87% lower than the OMP_Dynamic; and 91% lower than the TA-OMP approach.
In order to reinforce the need for a tool that specifically optimizes the impact of NBTI on the processor aging, we also compare the results of Hebe when the objective function is changed to optimize performance, energy, or EDP instead of aging. Fig. 3 presents the NBTI of each configuration normalized to Hebe w.r.t. the geometric mean of the entire benchmark set on each multicore processor. With that, we show that the performance, energy, and EDP oriented setup present an NBTI 12%, 12%, and 10% higher than the original objective function, respectively. We also compare Hebe to two well-known approaches that target performance and energy rather than aging: the FDT [65] and the Varuna programming model [61]. Finally, to measure the cost of the learning curve for converging to the ideal number of threads brought by its dynamic adaptation, we also compare Hebe to the solution provided by an exhaustive search. It executes each parallel region with its predefined ideal number of threads, without the learning overheads. From this, we found that the average cost of Hebe to reduce the impact of NBTI on the processor aging is less than 1.0% for most of the benchmarks.
The remainder of the manuscript is organized as follows. Hebe is described in Section 2. The benchmarks and execution environment used to validate Hebe are described in Section 3. We discuss the results in Section 4. We describe the related work and compare it to Hebe in Section 5, while the final considerations are drawn in Section 6.
Section snippets
Hebe: applying dynamic concurrency throttling to mitigate the processor aging
Hebe aims to reduce the impact of NBTI on the processor aging by tuning the degree of TLP of each parallel region of an OpenMP application. The general workflow of Hebe can be observed in Fig. 4: Given an OpenMP application binary (in this example, it has three parallel regions), Hebe first applies a search algorithm over each parallel region during the learning phase (described in Section 2.2) to find the number of threads that delivers the lowest impact of NBTI on the processor aging. Once
Benchmarks
Fifteen well-known applications written in C/C++ and already parallelized with OpenMP from assorted suites were chosen:
- •
Six kernels from the NAS Parallel Benchmark [5]: Block Tri-diagonal solver (BT-NAS), Conjugate Gradient (CG-NAS), discrete 3D fast Fourier Transform (FT-NAS), Lower-Upper Gauss-Seidel solver (LU-NAS), Scalar Penta-diagonal solver (SP-NAS), and Unstructured Adaptive mesh (UA-NAS). As the original version of NAS is written in FORTRAN, we use its OpenMP-C version [59].
- •
Three
Results
In this section, we present the results achieved by Hebe and compare them to the approaches defined in the previous section. Hence, we start by discussing the convergence of the search algorithm implemented by Hebe in Section 4.1. Then, we discuss in Section 4.2 what is the outcome of converging to an ideal or near-ideal number of threads for each parallel region by comparing the results of Hebe to the OpenMP regular execution, the OpenMP dynamic feature, and the thermal-aware approach proposed
Related work
We have split this section into three main parts: techniques to reduce aging at different levels are discussed in Section 5.1; approaches for thread throttling (but that do not target aging) in Section 5.2; and how Hebe advances correlates with them in Section 5.3.
Concluding remarks
In this manuscript, we have presented Hebe. It is an approach for OpenMP applications that reduces the impact of NBTI on the processor aging by finding the degree of TLP that offers the best trade-off between performance and processor temperature. Hebe optimizes an OpenMP application binary by only setting an environment variable in Linux OS without any code transformation or recompilation. Through an extensive set of experiments, we have shown that Hebe can reduce the impact of NBTI on the
CRediT authorship contribution statement
Thiarles S. Medeiros: Data curation, Methodology, Software, Writing – original draft. Luan Pereira: Investigation, Software, Validation, Writing – original draft. Fábio D. Rossi: Writing – original draft, Writing – review & editing. Marcelo C. Luizelli: Formal analysis, Writing – original draft, Writing – review & editing. Antonio Carlos S. Beck: Conceptualization, Writing – original draft, Writing – review & editing. Arthur F. Lorenzon: Conceptualization, Supervision, Writing – original draft,
Declaration of Competing Interest
There is no conflict of interest.
Thiarles Soares Medeiros is a MSc Student in Software Engineering at the Federal University of Pampa (UNIPAMPA), Brazil. Thiarles received his BS. in Computer Science from the UNIPAMPA, in 2012. His areas of interest include the parallelism exploitation in multicore systems and the design of approaches to automate and optimize the thread-level parallelism exploitation.
References (69)
- et al.
Gromacs: high performance molecular simulations through multi-level parallelism from laptops to supercomputers
SoftwareX
(2015) - et al.
The negative bias temperature instability in mos devices: a review
Microelectron. Reliab.
(2006) - et al.
Application of a hill-climbing technique to the formulation of a new cyclic nonlinear elastic constitutive model
Comput. Geotech.
(2012) - et al.
Wearout-aware compiler-directed register assignment for embedded systems
- et al.
Application-Level Energy Awareness for OpenMP
(2015) - et al.
SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance
(2001) - et al.
The nas parallel benchmarks; summary and preliminary results
- et al.
On-line thermal aware dynamic voltage scaling for energy optimization with frequency/temperature dependency consideration
- et al.
A distributed and self-calibrating model-predictive controller for energy and thermal management of high-performance multicores
- et al.
Abstractions for parallel n-body simulations
The parsec benchmark suite: characterization and architectural implications
Evolution of thread-level parallelism in desktop applications
Comput. Archit. News
Mechanism of negative-bias-temperature instability
J. Appl. Phys.
Thermal and energy management of high-performance multicores: distributed and self-calibrating model-predictive controller
IEEE Trans. Parallel Distrib. Syst.
When less is more (limo): controlled parallelism for improved efficiency
Online work maximization under a peak temperature constraint
Temperature-aware scheduling and assignment for hard real-time applications on mpsocs
IEEE Trans. Very Large Scale Integr. Syst.
Using OpenMP: Portable Shared Memory Parallel Programming
Rodinia: a benchmark suite for heterogeneous computing
Power multiplexing for thermal field management in many-core processors
IEEE Trans. Compon. Packag. Manuf. Technol.
Nbti mitigation in microprocessor designs
Proactive temperature balancing for low cost thermal management in mpsocs
Nbti mitigation by optimized nop assignment and insertion
Distributed task migration for thermal management in many-core systems
Portable, scalable, per-core power estimation for intelligent resource management
Heat-and-run: leveraging smt and cmp to manage power density through the operating system
Hole trapping and the negative bias temperature instability
Power measurement techniques on standard compute nodes: a quantitative comparison
Temperature-aware dvfs for hard real-time applications on multicore processors
IEEE Trans. Comput.
Thermal management for dependable on-chip systems
The openlb project: an open source and object oriented implementation of lattice Boltzmann methods
Int. J. Mod. Phys. C
Thermal-aware post compilation for vliw architectures
Lulesh 2.0 updates and changes
Cited by (5)
Seamless Thermal Optimization of Parallel Workloads
2023, IEEE Design and TestHarnessing the Effects of Process Variability to Mitigate Aging in Cloud Servers
2023, Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSIMachine learning based workload balancing scheme for minimizing stress migration induced aging in multicore processors
2023, International Journal of Information Technology (Singapore)Thermal-Aware Thread and Turbo Frequency Throttling Optimization for Parallel Applications
2022, 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design, SBCCI 2022 - Proceedings
Thiarles Soares Medeiros is a MSc Student in Software Engineering at the Federal University of Pampa (UNIPAMPA), Brazil. Thiarles received his BS. in Computer Science from the UNIPAMPA, in 2012. His areas of interest include the parallelism exploitation in multicore systems and the design of approaches to automate and optimize the thread-level parallelism exploitation.
Luan Pereira Vargas is a graduate student in Computer Science at the UNIPAMPA, Brazil. His areas of interest include high-performance computing, image processing, computer vision, and data analysis.
Fábio Diniz Rossi is a Lecturer at the Federal Institute of Science, Education and Technology Farroupilha (IFFar, Alegrete, RS, Brazil). He holds BS degree in Informatics from the University of the Region of Campanha (URCAMP, Brazil, 2000), MSc (2008) and PhD (2016) degrees in Computer Science from the Pontifical Catholic University of Rio Grande do Sul (PUCRS, Brazil). Dr. Rossi was a visitor at the University of Melbourne, Australia (2014-2015). Cisco Certified Network Associate (CCNA) Instructor. His primary research interests include Fog/Edge/Cloud computing.
Marcelo Caggiani Luizelli held MSc and PhD degree in Computer Science from Federal University of Rio Grande do Sul (UFRGS) under the supervision of Prof. Dr. Luciano Paschoal Gaspary and Prof. Dr. Luciana Salete Buriol. In 2015-2016, Dr. Luizelli was visiting the Computer Science Department of Technion University and NOKIA Bell Labs (Israel) under the supervision of Prof. Danny Raz. His current research interests are in the field of computer networks, focusing on the design of algorithms, and optimization techniques. Recently, he has been working with Network Function Virtualization, Software-Defined Networks, and Programmable Data Planes.
Antonio Carlos Schneider Beck Filho received his PhD degree from the Federal University of Rio Grande do Sul (UFRGS), Brazil, in 2008. Currently, he is an associate professor at the Applied Informatics Department at the Informatics Institute of UFRGS, in charge of Embedded Systems and Computer Organization disciplines at the undergraduate and graduate levels. His primary research interests include computer architectures and embedded systems design, focusing on power consumption. For more information, access www.inf.ufrgs.br/~caco.
Arthur Francisco Lorenzon received his PhD degree from the Federal University of Rio Grande do Sul (UFRGS), Brazil, in 2018. Currently, he is a professor at UNIPAMPA, Campus Alegrete, in charge of Computer Organization and Parallel Computing disciplines at the undergraduate and graduate levels. His areas of interest include the parallelism exploitation in multicore systems, evaluation of different parallel programming interfaces, and the design of approaches to automate and optimize the TLP exploitation.