PARIS: Predicting application resilience using machine learning

doi:10.1016/j.jpdc.2021.02.015

Journal of Parallel and Distributed Computing

Volume 152, June 2021, Pages 111-124

https://doi.org/10.1016/j.jpdc.2021.02.015 Get rights and content

Highlights

•
PARIS presents a machine-learning method to predict HPC application resilience.
•
PARIS captures the relationship between application characteristics and resilience.
•
PARIS avoids random fault injection and provides high prediction accuracy.
•
PARIS provides a foundation for resilience modeling using machine learning.

Abstract

The traditional method to study application resilience to errors in HPC applications uses fault injection (FI), a time-consuming approach. While analytical models have been built to overcome the inefficiencies of FI, they lack accuracy. In this paper, we present PARIS, a machine-learning method to predict application resilience that avoids the time-consuming process of random FI and provides higher prediction accuracy than analytical models. PARIS captures the implicit relationship between application characteristics and application resilience, which is difficult to capture using most analytical models. We overcome many technical challenges for feature construction, extraction, and selection to use machine learning in our prediction approach. Our evaluation on 16 HPC benchmarks shows that PARIS achieves high prediction accuracy. PARIS is up to 450x faster than random FI (49x on average). Compared to the state-of-the-art analytical model, PARIS is at least 63% better in terms of accuracy and has comparable execution time on average.

Introduction

As high performance computing (HPC) systems increase in scale, they become more susceptible to transient faults [6] due to feature size shrinking, lower voltages, and increasing densities in hardware infrastructures [59]. As a result, scientific applications running at extreme scales apply different resilience methods to tolerate frequent soft errors. Applying these methods to a given application often requires a deep understanding of the resilience of the application.

The common practice to study application resilience to errors in HPC systems is Fault Injection (FI) [11], [17], [62], [63], [64]. This approach uses a large amount of random injections, each of which randomly selects an instruction, and then triggers bit flips at the instruction input or output operands during application execution. Statistical results are then used to quantify application resilience.

While FI works in practice and is widely used in resilience studies, a key problem of this approach is that it is highly time consuming. To illustrate the problem, consider an application that runs for 6 hours—a common execution time for a large-scale scientific simulation [26]. Using statistical analysis (e.g., using [45]), the number of random FIs to obtain a low margin of error (e.g., 1%–3%) is in the order of thousands of injections. Thus, the total FI campaign could last several days. For multi-threaded or multi-process applications, this time is significantly higher since random faults must be injected into different threads or processes.

To address the limitations of FI, researchers have built error-propagation analytical models [46], which are faster than FI in estimating application resilience. However, they lack of accuracy as they estimate application resilience to errors based on analysis of possible errors in individual instructions. The analysis inaccuracy at individual instructions is accumulated, causing low accuracy to estimate the whole application resilience. Furthermore, these models do not consider effects of resilience computation patterns (e.g., dead corrupted locations and repeated addition [32]). Studying those patterns demands analyzing multiple instructions together, while most existing analytical models analyze instructions in isolation.

In summary, the community lacks a fundamental approach that enables fast and accurate evaluation of application resilience. In this paper, we present a novel framework called PARIS,¹ which avoids the time-consuming process of randomly selecting and executing many injections (as in FI), and provides higher prediction accuracy than analytical models, making it a unique solution to the problem. In essence, PARIS uses a machine learning model to predict application resilience, which provides several advantages. First, machine learning models, once trained, can be repeatedly used for any fault manifestations – silent data corruption (SDC), interruptions, and success cases – for new, previously unseen applications. Therefore, PARIS avoids a large amount of repeated fault injection tests, which leads to high efficiency in comparison to FI. Second, machine learning models can capture the implicit relationship between application characteristics (e.g., intensity of resilience computation patterns) and application resilience, which is difficult to capture by analytical models.

The most challenging part of using the machine learning approach is to build effective features. We use the following methods to construct a feature vector of 30 features.

First, we count the number of instruction instances within each instruction type as a feature; instruction instances are dynamic execution of instructions. We characterize instructions in such a way because different instruction types show different resilience to errors [12], [37]. To reduce the number of features, we classify instruction types into four representative and discriminative groups in terms of the functionality of instructions. This reduction of features reduces the training complexity and avoids undertraining.

Second, we count resilience computation patterns as features. Guo et al. [32] discover six resilience computation patterns from HPC applications. Those patterns are considered the fundamental reason for application resilience. Four of those patterns are based on individual instructions, and can be included as features using the above instruction type-based approach. The remaining two (“dead locations” and “repeated addition”) contain more than one instruction and cannot be captured by examining instructions individually. To efficiently count the two patterns, we introduce optimization techniques to avoid repeatedly scanning the instruction trace and find correlation between instructions.

Third, we introduce instruction execution order information into features to improve modeling accuracy. Execution order information is important to application resilience, because error propagation is highly correlated to the order and type of operations. Inspired by “N-gram” technique [16], [56] in computational linguistics, we embedded the sequence of instruction chunks into features to introduce execution order of instructions. Our evaluation shows that having execution order information decreases prediction error by up to 30%.

Fourth, we introduce resilience weight when counting instruction instances. Different instruction instances, even though they have the same instruction type, can have different capabilities to tolerate faults. Resilience weight quantifies the resilience difference of those instruction instances. Introducing resilience weight decreases prediction error by 13% on average when predicting the rate of some fault manifestation (particularly, the interruption rate).

Based upon the above features, we use feature selection techniques to sort and further reduce features. We perform ablation study to understand the sensitivity of features to prediction accuracy. We reveal significance of memory-related instructions and data overwriting to application resilience.

In summary, our contributions are as follows.

•
We present PARIS, a machine learning-based approach to predict application resilience. Our method breaks the fundamental tradeoff between evaluation speed and accuracy in the existing common practice to estimate application resilience.
•
We develop a framework and overcome a series of technical challenges for feature construction, extraction and selection. We reveal how to use machine learning to effectively and efficiently model application resilience.
•
We test our model on 16 benchmarks. We find that our approach is up to 450x faster than random FI (49x on average). The model has high prediction accuracy: a prediction error of 8.5% and 22% on average for predicting success rate and interruption rate (excluding two obvious outliers) respectively. We compare PARIS with Trident [46] (the state-of-the-art analytical model): PARIS can predict any fault manifestation rate (SDC, interruptions, and success), while Trident only predicts SDC rate; PARIS is at least 63% better than Trident in terms of accuracy for predicting SDC rate, and has comparable execution time (but faster for 12 out of the 16 benchmarks with 15x speedup on average).

Section snippets

Fault model

We consider transient faults in computation units of processors. For example, transient faults in the Arithmetic Logic Unit (ALU) and the address computation for loads and stores. We do not consider transient faults in memory components, such as caches, because these components are usually protected by Error Correcting Code (ECC) or parity at the architecture level. Similar assumptions are made in existing work [46], [63].

Furthermore, we consider single bit-flip model, not multiple bit-flip

Overview

We give a high-level overview of PARIS. Fig. 1 depicts the workflow of the training process of PARIS. The most challenging part of the training process is to construct features relevant to application resilience that can produce high modeling accuracy.

Features Construction. We use instruction type and number of instruction instances for each type as a feature. A static instruction in a program has an instruction type (opcode), and can be executed many times, each of which is an instruction

Feature construction

For feature construction, we have the following requirements: (1) features should be relevant to application resilience; (2) the number of features should be small enough (smaller than the number of applications used for training) to avoid under-determination of the model; (3) we should avoid redundant and irrelevant features since these features can increase prediction error. Following the above requirements, we introduce instructions, resilience computation patterns, resilience weight, and

Implementation

Dataset Construction. We have multiple requirements for creating training and testing dataset. (1) The training dataset must be large to avoid model underdetermination; (2) Applications used to generate training and testing dataset must have diverse computation and diverse resilience characteristics; (3) Applications used to generate training and testing dataset must have explicit result verification phases. Having the verification phase allows us to determine the fault manifestations.

Testing

Evaluation

We evaluate our model and modeling methods from two perspectives: (1) modeling accuracy; (2) contributions of features and optimization techniques to modeling accuracy.

To evaluate modeling accuracy, we calculate MAPE for the predicted success, SDC, and interruption rates compared with the ground-truth rates measured by performing FI campaigns. We use PINFI [63] for FI. For each program in our dataset, we perform an FI campaign of 3000 random fault injections, following the statistical

Discussions

Use of PARIS. To use PARIS, the user only needs to train the prediction model once, and then the trained model can be repeatedly used for predicting error resilience of any application. Predicting application resilience is useful for improving application resilience [14], [32] and optimizing fault tolerance mechanisms [20], [39], [46], [65]. To train the prediction model, the user must follow the training workflow in Fig. 1. Given a new application, the user needs to generate a dynamic

Related work

Using Machine Learning to Address Resilience Problems. Recent research starts to use ML to address resilience problems [3], [20], [28], [30], [39], [44], [52]. Laguna et al. [44] train an ML classifier IPAS. IPAS learns which instructions can have a high likelihood of leading to a silent output corruption. Desh [20] predicts node failures by training a recurrent neural network model using system logs. Nie et al. [52] use system logs to predict the future occurrence of GPU errors. PRISM [39]

Conclusions

Understanding application resilience to errors becomes increasingly important to ensure result correctness for HPC applications. The traditional method (FI) to understand application resilience is too expensive. Analytical models are faster but they are not as accurate as FI. This paper introduces PARIS, a new solution based on ML to solve the above problems. We discuss feature constructions, extraction and selection, which are the keys to enable high-performance ML for predicting application

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Karthik Pattabiraman, University of British Columbia

Acknowledgments

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-766363). This research was partially supported by U.S. National Science Foundation (CNS-1617967, CCF-1553645 and CCF-1718194).

Dr. Luanzheng Guo is a postdoctoral researcher at the Pacific Northwest National Laboratory, working with the HPC Group in the research area between HPC and Machine Learning. He obtained his Ph.D. degree in Electrical Engineering and Computer Science from the University of California-Merced in 2020. His Ph.D. research focused on system resilience and reliability in large-scale parallel HPC systems.

References (67)

De MyttenaereA. et al.
Mean absolute percentage error for regression models
Neurocomputing
(2016)
Coral Benchmark Codes [online],...
AktulgaH.M. et al.
Parallel reactive molecular dynamics: Numerical methods and algorithmic techniques
Parallel Comput.
(2012)
AshrafR. et al.
Understanding the propagation of transient errors in HPC applications
SC
(2015)
BaileyD.H. et al.
NAS Parallel benchmark results
SC
(1992)
BattitiR.
Using mutual information for selecting features in supervised neural net learning
IEEE Trans. Neural Netw.
(1994)
BaumannR.C.
Radiation-induced soft errors in advanced semiconductor technologies
IEEE Trans. Device Mater. Reliab.
(2005)
BergstraJ. et al.
Random search for hyper-parameter optimization
JMLR
(2012)
BieniaC. et al.
The parsec benchmark suite: Characterization and architectural implications
PACT
(2008)
BradleyP.S. et al.
Feature selection via concave minimization and support vector machines
ICML
(1998)

CaiL. et al.

Probabilistic wind power forecasting approach via instance-based transfer learning embedded gradient boosting decision trees

Energies

(2019)

J. Calhoun, L. Olson, M. Snir, Flipit: An LLVM based fault injector for HPC, in: Workshops in Euro-Par,...

CalhounJ. et al.

Towards a more complete understanding of sdc propagation

HPDC

(2017)

CappelloF.

Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

Int. J. High Perform. Comput. Appl.

(2009)

CasasM. et al.

Fault resilience of the multi-grid solver

ICS

(2012)

CheS. et al.

Rodinia: A benchmark suite for heterogeneous computing

IISWC

(2009)

ChenX. et al.

Gated recursive neural network for chinese word segmentation

ACL

(2015)

CherC.-Y. et al.

Understanding soft error resiliency of BlueGene/Q compute chip through hardware proton irradiation and software fault injection

SC

(2014)

ChoubinB. et al.

Multiple linear regression, multi-layer perceptron network and adaptive neuro-fuzzy inference system for forecasting precipitation based on large-scale climate signals

Hydrol. Sci. J.

(2016)

CoatesA. et al.

An analysis of single-layer networks in unsupervised feature learning

AISTATS

(2011)

DasA. et al.

Desh: deep learning for system health prediction of lead times to failure in hpc

HPDC

(2018)

De OliveiraD.A.G. et al.

Radiation-induced error criticality in modern hpc parallel accelerators

HPCA

(2017)

DengJ. et al.

Imagenet: A large-scale hierarchical image database

DomingosP.

Bayesian Averaging of classifiers and the overfitting problem

ICML

(2000)

DruckerH. et al.

Boosting and other ensemble methods

Neural Comput.

(1994)

EgwutuohaI.P. et al.

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

J. Supercomput.

(2013)

FriedmanJ.H.

Stochastic gradient boosting

Comput. Stat. Data Anal.

(2002)

GeorgakoudisG. et al.

Reinit++: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

ISC

(2020)

GeorgakoudisG. et al.

REFINE : Realistic fault injection via compiler-based instrumentation for accuracy, portability and speed

SC

(2017)

GuoL. et al.

Match: An mpi fault tolerance benchmark suite

L. Guo, D. Li, MOARD: Modeling application resilience to transient faults on data objects, in: International Parallel...

GuoL. et al.

Fliptracker: Understanding natural error resilience in hpc applications

SC

(2018)

GuyonI. et al.

An introduction to variable and feature selection

JMLR

(2003)

Cited by (12)

BiGResi: Robust bit-level fault injection framework for assessing intrinsic software resilience against soft errors
2024, Computers and Electrical Engineering
Radiation-induced soft errors, despite rare, pose a significant threat to the reliability of systems. Assessing the intrinsic resilience of software to soft errors is therefore essential for building fault-tolerant systems cost-effectively. Analytical models, while fast, can be imprecise. In contrast, Fault Injection (FI) has been successfully applied as a mature method for reliability assessment. While high-level FI offers less accuracy, existing low-level techniques can enhance resilience assessment accuracy by sacrificing some desirable features like fault coverage and intrusiveness. Furthermore, these techniques are often driven by random FI campaigns, making establishing a clear correlation between application characteristics and resilience challenging. This paper presents BiGResi, a versatile software-based framework for assessing software resilience. BiGResi overcomes the limitations of random, instruction type-agnostic FI techniques by evaluating resilience at a low-level granularity, considering instruction type and bit location. Furthermore, it targets the instruction set architecture (ISA), enhancing assessment accuracy by revealing architecturally visible faults. BiGResi employs a timing-based FI mechanism with negligible modifications to the target software, minimizing intrusiveness and ensuring near-native speed. BiGResi’s accuracy is empirically evaluated through many FI campaigns targeting different benchmarks with diverse characteristics. We observed that instruction types, ISA encoding bits, and bit location are key factors to consider when assessing software resilience. Finally, BiGResi’s effectiveness is demonstrated by selectively applying instruction protection, resulting in an average reduction of silent data corruptions (SDCs) by 73.80%, with a performance overhead of 15.46%. Furthermore, allowing a slightly higher overhead of 22% can improve the SDC detection rate by up to 93.83%.
ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUs
2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Workload prediction for SLA performance in cloud environment: ESANN approach
2023, Intelligent Decision Technologies
Soft error vulnerability prediction of GPGPU applications
2023, Journal of Supercomputing
SLOGAN: SDC Probability Estimation Using Structured Graph Attention Network
2023, Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC
Visilience: An Interactive Visualization Framework for Resilience Analysis using Control-Flow Graph
2023, Proceedings of IEEE Pacific Rim International Symposium on Dependable Computing, PRDC

View all citing articles on Scopus

Dr. Dong Li is an associate professor in the Department of Electrical Engineering and Computer Science, University of California, Merced. He is the director of the Parallel Architecture, System, and Algorithm Lab (PASA).

Dr. Ignacio Laguna is a Computer Scientist at the Center for Applied Scientific Computing (CASC) at the Lawrence Livermore National Laboratory (LLNL), California. His main area of research is high-performance computing (HPC); his main sub-area of research in HPC is programming models and systems. He is in particular interested in fault tolerance, fault resilience, debugging, software correctness and general software reliability.

View full text

Published by Elsevier Inc.

PARIS: Predicting application resilience using machine learning

Highlights

Abstract

Introduction

Section snippets

Fault model

Overview

Feature construction

Implementation

Evaluation

Discussions

Related work

Conclusions

Declaration of Competing Interest

Acknowledgments

Neurocomputing

Parallel reactive molecular dynamics: Numerical methods and algorithmic techniques

Parallel Comput.

Understanding the propagation of transient errors in HPC applications

SC

NAS Parallel benchmark results

SC

Using mutual information for selecting features in supervised neural net learning

IEEE Trans. Neural Netw.

Radiation-induced soft errors in advanced semiconductor technologies

IEEE Trans. Device Mater. Reliab.

Random search for hyper-parameter optimization

JMLR

The parsec benchmark suite: Characterization and architectural implications

PACT

Feature selection via concave minimization and support vector machines

ICML

Probabilistic wind power forecasting approach via instance-based transfer learning embedded gradient boosting decision trees

Energies

Towards a more complete understanding of sdc propagation

HPDC

Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

Int. J. High Perform. Comput. Appl.

Fault resilience of the multi-grid solver

ICS

Rodinia: A benchmark suite for heterogeneous computing

IISWC

Gated recursive neural network for chinese word segmentation

ACL

Understanding soft error resiliency of BlueGene/Q compute chip through hardware proton irradiation and software fault injection

SC

Multiple linear regression, multi-layer perceptron network and adaptive neuro-fuzzy inference system for forecasting precipitation based on large-scale climate signals

Hydrol. Sci. J.

An analysis of single-layer networks in unsupervised feature learning

AISTATS

Desh: deep learning for system health prediction of lead times to failure in hpc

HPDC

Radiation-induced error criticality in modern hpc parallel accelerators

HPCA

Imagenet: A large-scale hierarchical image database

Bayesian Averaging of classifiers and the overfitting problem

ICML

Boosting and other ensemble methods

Neural Comput.

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

J. Supercomput.

Stochastic gradient boosting

Comput. Stat. Data Anal.

Reinit++: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

ISC

REFINE : Realistic fault injection via compiler-based instrumentation for accuracy, portability and speed

SC

Match: An mpi fault tolerance benchmark suite

Fliptracker: Understanding natural error resilience in hpc applications

SC

An introduction to variable and feature selection

JMLR