PARIS: Predicting application resilience using machine learning

https://doi.org/10.1016/j.jpdc.2021.02.015Get rights and content

Highlights

  • PARIS presents a machine-learning method to predict HPC application resilience.

  • PARIS captures the relationship between application characteristics and resilience.

  • PARIS avoids random fault injection and provides high prediction accuracy.

  • PARIS provides a foundation for resilience modeling using machine learning.

Abstract

The traditional method to study application resilience to errors in HPC applications uses fault injection (FI), a time-consuming approach. While analytical models have been built to overcome the inefficiencies of FI, they lack accuracy. In this paper, we present PARIS, a machine-learning method to predict application resilience that avoids the time-consuming process of random FI and provides higher prediction accuracy than analytical models. PARIS captures the implicit relationship between application characteristics and application resilience, which is difficult to capture using most analytical models. We overcome many technical challenges for feature construction, extraction, and selection to use machine learning in our prediction approach. Our evaluation on 16 HPC benchmarks shows that PARIS achieves high prediction accuracy. PARIS is up to 450x faster than random FI (49x on average). Compared to the state-of-the-art analytical model, PARIS is at least 63% better in terms of accuracy and has comparable execution time on average.

Introduction

As high performance computing (HPC) systems increase in scale, they become more susceptible to transient faults [6] due to feature size shrinking, lower voltages, and increasing densities in hardware infrastructures [59]. As a result, scientific applications running at extreme scales apply different resilience methods to tolerate frequent soft errors. Applying these methods to a given application often requires a deep understanding of the resilience of the application.

The common practice to study application resilience to errors in HPC systems is Fault Injection (FI) [11], [17], [62], [63], [64]. This approach uses a large amount of random injections, each of which randomly selects an instruction, and then triggers bit flips at the instruction input or output operands during application execution. Statistical results are then used to quantify application resilience.

While FI works in practice and is widely used in resilience studies, a key problem of this approach is that it is highly time consuming. To illustrate the problem, consider an application that runs for 6 hours—a common execution time for a large-scale scientific simulation [26]. Using statistical analysis (e.g., using [45]), the number of random FIs to obtain a low margin of error (e.g., 1%–3%) is in the order of thousands of injections. Thus, the total FI campaign could last several days. For multi-threaded or multi-process applications, this time is significantly higher since random faults must be injected into different threads or processes.

To address the limitations of FI, researchers have built error-propagation analytical models [46], which are faster than FI in estimating application resilience. However, they lack of accuracy as they estimate application resilience to errors based on analysis of possible errors in individual instructions. The analysis inaccuracy at individual instructions is accumulated, causing low accuracy to estimate the whole application resilience. Furthermore, these models do not consider effects of resilience computation patterns (e.g., dead corrupted locations and repeated addition [32]). Studying those patterns demands analyzing multiple instructions together, while most existing analytical models analyze instructions in isolation.

In summary, the community lacks a fundamental approach that enables fast and accurate evaluation of application resilience. In this paper, we present a novel framework called PARIS,1 which avoids the time-consuming process of randomly selecting and executing many injections (as in FI), and provides higher prediction accuracy than analytical models, making it a unique solution to the problem. In essence, PARIS uses a machine learning model to predict application resilience, which provides several advantages. First, machine learning models, once trained, can be repeatedly used for any fault manifestations – silent data corruption (SDC), interruptions, and success cases – for new, previously unseen applications. Therefore, PARIS avoids a large amount of repeated fault injection tests, which leads to high efficiency in comparison to FI. Second, machine learning models can capture the implicit relationship between application characteristics (e.g., intensity of resilience computation patterns) and application resilience, which is difficult to capture by analytical models.

The most challenging part of using the machine learning approach is to build effective features. We use the following methods to construct a feature vector of 30 features.

First, we count the number of instruction instances within each instruction type as a feature; instruction instances are dynamic execution of instructions. We characterize instructions in such a way because different instruction types show different resilience to errors [12], [37]. To reduce the number of features, we classify instruction types into four representative and discriminative groups in terms of the functionality of instructions. This reduction of features reduces the training complexity and avoids undertraining.

Second, we count resilience computation patterns as features. Guo et al. [32] discover six resilience computation patterns from HPC applications. Those patterns are considered the fundamental reason for application resilience. Four of those patterns are based on individual instructions, and can be included as features using the above instruction type-based approach. The remaining two (“dead locations” and “repeated addition”) contain more than one instruction and cannot be captured by examining instructions individually. To efficiently count the two patterns, we introduce optimization techniques to avoid repeatedly scanning the instruction trace and find correlation between instructions.

Third, we introduce instruction execution order information into features to improve modeling accuracy. Execution order information is important to application resilience, because error propagation is highly correlated to the order and type of operations. Inspired by “N-gram” technique [16], [56] in computational linguistics, we embedded the sequence of instruction chunks into features to introduce execution order of instructions. Our evaluation shows that having execution order information decreases prediction error by up to 30%.

Fourth, we introduce resilience weight when counting instruction instances. Different instruction instances, even though they have the same instruction type, can have different capabilities to tolerate faults. Resilience weight quantifies the resilience difference of those instruction instances. Introducing resilience weight decreases prediction error by 13% on average when predicting the rate of some fault manifestation (particularly, the interruption rate).

Based upon the above features, we use feature selection techniques to sort and further reduce features. We perform ablation study to understand the sensitivity of features to prediction accuracy. We reveal significance of memory-related instructions and data overwriting to application resilience.

In summary, our contributions are as follows.

  • We present PARIS, a machine learning-based approach to predict application resilience. Our method breaks the fundamental tradeoff between evaluation speed and accuracy in the existing common practice to estimate application resilience.

  • We develop a framework and overcome a series of technical challenges for feature construction, extraction and selection. We reveal how to use machine learning to effectively and efficiently model application resilience.

  • We test our model on 16 benchmarks. We find that our approach is up to 450x faster than random FI (49x on average). The model has high prediction accuracy: a prediction error of 8.5% and 22% on average for predicting success rate and interruption rate (excluding two obvious outliers) respectively. We compare PARIS with Trident [46] (the state-of-the-art analytical model): PARIS can predict any fault manifestation rate (SDC, interruptions, and success), while Trident only predicts SDC rate; PARIS is at least 63% better than Trident in terms of accuracy for predicting SDC rate, and has comparable execution time (but faster for 12 out of the 16 benchmarks with 15x speedup on average).

Section snippets

Fault model

We consider transient faults in computation units of processors. For example, transient faults in the Arithmetic Logic Unit (ALU) and the address computation for loads and stores. We do not consider transient faults in memory components, such as caches, because these components are usually protected by Error Correcting Code (ECC) or parity at the architecture level. Similar assumptions are made in existing work [46], [63].

Furthermore, we consider single bit-flip model, not multiple bit-flip

Overview

We give a high-level overview of PARIS. Fig. 1 depicts the workflow of the training process of PARIS. The most challenging part of the training process is to construct features relevant to application resilience that can produce high modeling accuracy.

Features Construction. We use instruction type and number of instruction instances for each type as a feature. A static instruction in a program has an instruction type (opcode), and can be executed many times, each of which is an instruction

Feature construction

For feature construction, we have the following requirements: (1) features should be relevant to application resilience; (2) the number of features should be small enough (smaller than the number of applications used for training) to avoid under-determination of the model; (3) we should avoid redundant and irrelevant features since these features can increase prediction error. Following the above requirements, we introduce instructions, resilience computation patterns, resilience weight, and

Implementation

Dataset Construction. We have multiple requirements for creating training and testing dataset. (1) The training dataset must be large to avoid model underdetermination; (2) Applications used to generate training and testing dataset must have diverse computation and diverse resilience characteristics; (3) Applications used to generate training and testing dataset must have explicit result verification phases. Having the verification phase allows us to determine the fault manifestations.

Testing

Evaluation

We evaluate our model and modeling methods from two perspectives: (1) modeling accuracy; (2) contributions of features and optimization techniques to modeling accuracy.

To evaluate modeling accuracy, we calculate MAPE for the predicted success, SDC, and interruption rates compared with the ground-truth rates measured by performing FI campaigns. We use PINFI [63] for FI. For each program in our dataset, we perform an FI campaign of 3000 random fault injections, following the statistical

Discussions

Use of PARIS. To use PARIS, the user only needs to train the prediction model once, and then the trained model can be repeatedly used for predicting error resilience of any application. Predicting application resilience is useful for improving application resilience [14], [32] and optimizing fault tolerance mechanisms [20], [39], [46], [65]. To train the prediction model, the user must follow the training workflow in Fig. 1. Given a new application, the user needs to generate a dynamic

Related work

Using Machine Learning to Address Resilience Problems. Recent research starts to use ML to address resilience problems [3], [20], [28], [30], [39], [44], [52]. Laguna et al. [44] train an ML classifier IPAS. IPAS learns which instructions can have a high likelihood of leading to a silent output corruption. Desh [20] predicts node failures by training a recurrent neural network model using system logs. Nie et al. [52] use system logs to predict the future occurrence of GPU errors. PRISM [39]

Conclusions

Understanding application resilience to errors becomes increasingly important to ensure result correctness for HPC applications. The traditional method (FI) to understand application resilience is too expensive. Analytical models are faster but they are not as accurate as FI. This paper introduces PARIS, a new solution based on ML to solve the above problems. We discuss feature constructions, extraction and selection, which are the keys to enable high-performance ML for predicting application

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Karthik Pattabiraman, University of British Columbia

Acknowledgments

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-766363). This research was partially supported by U.S. National Science Foundation (CNS-1617967, CCF-1553645 and CCF-1718194).

Dr. Luanzheng Guo is a postdoctoral researcher at the Pacific Northwest National Laboratory, working with the HPC Group in the research area between HPC and Machine Learning. He obtained his Ph.D. degree in Electrical Engineering and Computer Science from the University of California-Merced in 2020. His Ph.D. research focused on system resilience and reliability in large-scale parallel HPC systems.

References (67)

  • De MyttenaereA. et al.

    Mean absolute percentage error for regression models

    Neurocomputing

    (2016)
  • Coral Benchmark Codes [online],...
  • AktulgaH.M. et al.

    Parallel reactive molecular dynamics: Numerical methods and algorithmic techniques

    Parallel Comput.

    (2012)
  • AshrafR. et al.

    Understanding the propagation of transient errors in HPC applications

    SC

    (2015)
  • BaileyD.H. et al.

    NAS Parallel benchmark results

    SC

    (1992)
  • BattitiR.

    Using mutual information for selecting features in supervised neural net learning

    IEEE Trans. Neural Netw.

    (1994)
  • BaumannR.C.

    Radiation-induced soft errors in advanced semiconductor technologies

    IEEE Trans. Device Mater. Reliab.

    (2005)
  • BergstraJ. et al.

    Random search for hyper-parameter optimization

    JMLR

    (2012)
  • BieniaC. et al.

    The parsec benchmark suite: Characterization and architectural implications

    PACT

    (2008)
  • BradleyP.S. et al.

    Feature selection via concave minimization and support vector machines

    ICML

    (1998)
  • CaiL. et al.

    Probabilistic wind power forecasting approach via instance-based transfer learning embedded gradient boosting decision trees

    Energies

    (2019)
  • J. Calhoun, L. Olson, M. Snir, Flipit: An LLVM based fault injector for HPC, in: Workshops in Euro-Par,...
  • CalhounJ. et al.

    Towards a more complete understanding of sdc propagation

    HPDC

    (2017)
  • CappelloF.

    Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

    Int. J. High Perform. Comput. Appl.

    (2009)
  • CasasM. et al.

    Fault resilience of the multi-grid solver

    ICS

    (2012)
  • CheS. et al.

    Rodinia: A benchmark suite for heterogeneous computing

    IISWC

    (2009)
  • ChenX. et al.

    Gated recursive neural network for chinese word segmentation

    ACL

    (2015)
  • CherC.-Y. et al.

    Understanding soft error resiliency of BlueGene/Q compute chip through hardware proton irradiation and software fault injection

    SC

    (2014)
  • ChoubinB. et al.

    Multiple linear regression, multi-layer perceptron network and adaptive neuro-fuzzy inference system for forecasting precipitation based on large-scale climate signals

    Hydrol. Sci. J.

    (2016)
  • CoatesA. et al.

    An analysis of single-layer networks in unsupervised feature learning

    AISTATS

    (2011)
  • DasA. et al.

    Desh: deep learning for system health prediction of lead times to failure in hpc

    HPDC

    (2018)
  • De OliveiraD.A.G. et al.

    Radiation-induced error criticality in modern hpc parallel accelerators

    HPCA

    (2017)
  • DengJ. et al.

    Imagenet: A large-scale hierarchical image database

  • DomingosP.

    Bayesian Averaging of classifiers and the overfitting problem

    ICML

    (2000)
  • DruckerH. et al.

    Boosting and other ensemble methods

    Neural Comput.

    (1994)
  • EgwutuohaI.P. et al.

    A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

    J. Supercomput.

    (2013)
  • FriedmanJ.H.

    Stochastic gradient boosting

    Comput. Stat. Data Anal.

    (2002)
  • GeorgakoudisG. et al.

    Reinit++: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

    ISC

    (2020)
  • GeorgakoudisG. et al.

    REFINE : Realistic fault injection via compiler-based instrumentation for accuracy, portability and speed

    SC

    (2017)
  • GuoL. et al.

    Match: An mpi fault tolerance benchmark suite

  • L. Guo, D. Li, MOARD: Modeling application resilience to transient faults on data objects, in: International Parallel...
  • GuoL. et al.

    Fliptracker: Understanding natural error resilience in hpc applications

    SC

    (2018)
  • GuyonI. et al.

    An introduction to variable and feature selection

    JMLR

    (2003)
  • Cited by (12)

    • ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUs

      2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    • SLOGAN: SDC Probability Estimation Using Structured Graph Attention Network

      2023, Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC
    • Visilience: An Interactive Visualization Framework for Resilience Analysis using Control-Flow Graph

      2023, Proceedings of IEEE Pacific Rim International Symposium on Dependable Computing, PRDC
    View all citing articles on Scopus

    Dr. Luanzheng Guo is a postdoctoral researcher at the Pacific Northwest National Laboratory, working with the HPC Group in the research area between HPC and Machine Learning. He obtained his Ph.D. degree in Electrical Engineering and Computer Science from the University of California-Merced in 2020. His Ph.D. research focused on system resilience and reliability in large-scale parallel HPC systems.

    Dr. Dong Li is an associate professor in the Department of Electrical Engineering and Computer Science, University of California, Merced. He is the director of the Parallel Architecture, System, and Algorithm Lab (PASA).

    Dr. Ignacio Laguna is a Computer Scientist at the Center for Applied Scientific Computing (CASC) at the Lawrence Livermore National Laboratory (LLNL), California. His main area of research is high-performance computing (HPC); his main sub-area of research in HPC is programming models and systems. He is in particular interested in fault tolerance, fault resilience, debugging, software correctness and general software reliability.

    View full text