Boundary sampling to boost mutation testing for deep learning models

doi:10.1016/j.infsof.2020.106413

Information and Software Technology

Volume 130, February 2021, 106413

https://doi.org/10.1016/j.infsof.2020.106413 Get rights and content

Abstract

Context: The prevalent application of Deep Learning (DL) models has raised concerns about their reliability. Due to the data-driven programming paradigm, the quality of test datasets is extremely important to gain accurate assessment of DL models. Recently, researchers have introduced mutation testing into DL testing, which applies mutation operators to generate mutants from DL models, and observes whether the test data can identify mutants to check the quality of test dataset. However, there still exist many factors (e.g., huge labeling efforts and high running cost) hindering the implementation of mutation testing for DL models.

Objective: We desire for an approach to selecting a smaller, sensitive, representative and efficient subset of the whole test dataset to promote the current mutation testing (e.g., reduce labeling and running cost) for DL Models.

Method: We propose boundary sample selection (BSS), which employs the distance of samples to decision boundary of DL models as the indicator to construct the appropriate subset. To evaluate the performance of BSS, we conduct an extensive empirical study with two widely-used datasets, three popular DL models, and 14 up-to-date DL mutation operators. Results

: We observe that (1) The sizes of our subsets generated by BSS are much smaller (about 3%-20% of the whole test set). (2) Under most mutation operators, our subsets are superior (about 9.94-21.63) than the whole test sets in observing mutation effects. (3) Our subsets could replace the whole test sets to a very high degree (higher than 97%) when considering mutation score. (4) The MRR values of our proposed subsets are clearly better (about 2.28-13.19 times higher) than that of the whole test sets.

Conclusions: The result shows that BSS can help testers save labelling cost, run mutation testing quickly and identify killed mutants early.

Introduction

Deep Learning (DL) models compose simple but non-linear modules from multiple levels of deep neural networks (DNNs) to construct a higher and more abstract representation of raw inputs, which aim to amplify the dimensions of the inputs that are important and helpful for classification tasks [1]. Massive labeled training data [2] and the explosion of computing abilities [3] accelerate the development and popularization of DL techniques, which have brought about breakthroughs in many areas, including speech recognition [4], image recognition [5], natural language processing [6], and so on.

The prevalent application of DL models has raised concerns about the quality and reliability of the models, especially for safety-critical domains (e.g., disease diagnosis [7] and autonomous driving [8]). Software testing furnishes an objective evaluation of software quality by comparing the required behavior and the actual behavior of software systems [9], which has been evolved into one of the mainstream technology to construct DL testing, by exposing defects and improving the reliability of DL models [10], [11], [12], [13], [14].

Due to the data-driven programming paradigm of DL models, evaluating the test power of test dataset can be one of the most important parts of DL testing [15]. In DL testing, testers may benefit from knowing the power of the test suite in detecting defects of DL models, since the test suite with stronger power is prone to detect more defects [16]. For traditional software, mutation testing is a methodology to systematically check the test power of test suite, which employs mutation operators to generate faulty versions (i.e., mutants) of the software, and checks to what extent the test suite can detect these faults in mutants [17]. Previous results of traditional software have shown that mutation testing has the best performance in terms of fault revelation [18]. Recently, researchers introduced mutation testing into DL testing by designing and implementing mutation operations of DL models. Shen et al. proposed MuNN, a mutation analysis method for DL models, which consists of five mutation operators to construct the mutants of DL models [19]. Ma et al. proposed a mutation testing framework named DeepMutation to measure the quality of test suite, including a set of source-level and model-level mutation operators [15]. Wang et al. proposed a method to detect adversarial samples for DL models through mutation testing [20].

Currently, mutation testing for DL models is still at an early stage. After analyzing the results and findings of prior studies, we find out that there exist the following three factors hindering the implementation of mutation testing for DL models:

•
Oracle workload. The oracle refers to the label information of test data in DL testing. For the testers, the primary challenge is to label the test data with very huge human labors. For example, as reported¹, a skilled worker in labelling pictures can only deal with 40 pictures per day in average.
•
Running cost. Mutation testing can be very expensive, which has already been discussed in traditional software [21]. It requires the generation of mutants based on the mutation operators, and the execution of mutants against the whole test data. Compared with traditional software, the amount of training data² and test data is very large with the development of DL models. This makes the execution of mutants against the large-scale test data very time-consuming.
•
Mutant sensitivity. Different from traditional software, DL mutation operators contain randomness and uncertainty (e.g., adding Gaussian noise to weight values in the structure of ANNs), so we need to observe the mutation effects to assess the validity of operators. The result of previous studies [15], [19] showed that the most of inputs from test data are not sensitive to DL mutants, i.e., the output results of mutants against the most of inputs are the same as the ones of original models, which is difficulty to observe the difference (e.g., from the point of test accuracy) between the original models and their mutants.

To address these above, testers have their desire for an approach to promoting the current mutation testing activities in DL testing, by reducing oracle workload, decreasing running cost, and increasing mutant sensitivity. Motivated by this purpose, we focus on the following problem: Can we obtain results of mutation testing on DL models, by selecting a smaller, sensitive, representative and efficient subset of the whole test data to implement mutation testing?”; and we propose Boundary Sample Selection (BSS) as the affirmative answer of our focused problem.

To the best of our knowledge, BSS is the first DL mutation testing technique that selects a subset of the whole test data to implement mutation testing. BSS only requires the output probability values, not the real labels. Specifically, BSS is based on the following intuition that the samples near decision boundary of the original DL models are more sensitive to the effects of mutation operations. That inspires us to employ the distance of samples to decision boundary as the indicator to select the subset of samples. To measure the distance, we carry out the idea that the more easily a sample can be misclassified, the closer to boundary it should be (see detail in Section 3). To sum up, instead of evaluating test power using the whole test data, BSS selects the appropriate subset near the decision boundary of DL models, which would cut down labeling workload³ and running cost, and make testers more easy to observe the effects of mutation operations.

Here we give a brief framework with the flow of activities performed to evaluate the performance of BSS in our approach⁴ in Fig. 1. First, we train three original models from the datasets and DL model structures. BSS has been evaluated on 2 widely used dataset MNIST [22] and CIFAR-10 [23] under three popular DL model structures [24], [25], [26]. Next, we apply mutation operators (14 up-to-date mutation operators [15], [19] from both source-level and model-level) to generate mutated models. For each DL model, we construct more than 300 mutants to conduct our experiments. Finally, we evaluate the performance of our BSS in mutation testing for DL models by four research questions. Based on the original models, the boundary samples are identified by the distance to decision boundary, by which we set a threshold to select the most near samples from the whole test samples. We introduce a set of metrics (e.g., normal and stronger mutation score) to compare the results of running mutation testing on the generated subsets by BSS and the whole test data. To examine whether BSS performs well in DL mutation testing, we organize our experiments by addressing the following four research questions (RQs):

•
RQ1 (Small size): are the sizes of the subsets generated by BSS much smaller than the whole test data?
The result of our experiments with subsets generated by BSS shows that the sizes of the subsets generated by BSS are much smaller (about 3% to 20% of the whole test data), and are stable with coefficients of variation lower than or equal to 0.1.
•
RQ2 (Sensibility): are the subsets generated by BSS more sensitive to mutation operations of DL models?
We compare the performance of boundary samples with the whole test set in observing the performance fluctuation caused by mutation. We use the differences of accuracy (between original and mutated models) as the performance indicator. The experiment results show that the subsets generated by BSS are superior in observing the effect of mutation operators. Under most mutation operators, the subsets are superior (from 9.94 to 21.63) than the whole test set.
•
RQ3 (Representativeness): are the subsets generated by BSS representative to evaluate the results of mutation testing under the whole test data?
In this RQ, we apply widely used indicators (i.e., mutation score) to evaluate whether the subsets generated by BSS could substitute the whole test set for DL mutation testing. We observe that the subsets generated by BSS could replace the whole test set to a very high degree when considering normal mutation score (close to 100%) and stronger mutation score (higher than 97%) in our experiments.
•
RQ4 (Efficiency): are the subsets generated by BSS more efficient than the whole test set under DL mutation testing?
In order to analyze the efficiency, we apply the efficiency evaluation metric Mean Reciprocal Rank (MRR), which has been widely used in retrieve algorithms. The result shows that the MRR values of our proposed subsets are clearly better (on average, 8.30, 13.19, and 2.28 times greater⁵ for three DL models, respectively) than that of the whole test set.

The main contributions of this paper are summarized as follows:

•
We put forward Boundary Sample Selection (BSS), the first DL mutation testing technique that selects a subset of the whole test set to implement mutation testing, which can optimize the process by reducing labeling and running cost and improving mutant sensitivity.
•
We conduct an empirical experiment with 14 popular mutation operators and nearly 5000 mutants of DL models to check the performance of BSS. The result of our experiments shows that BSS is very useful in mutation testing, i.e., the subset generated by BSS is much smaller, more sensitive, almost completely representative and more efficient, compared with the whole test set.

The rest of our paper is organized as follows. The background and related work are presented in Section 2. In Section 3, we define the boundary sample and describe Boundary Sample Selection in detail. The experimental settings are introduced in Section 4, including studied datasets, subject models, mutation operators, research questions, and so on. The experimental results and findings are explained in Section 5. Sections 6 and 7 further discuss some important experimental details and threats to validity of our study, respectively. Section 8 concludes our paper.

Section snippets

Background and related work

In this section, we elaborate the basic structure of deep neural networks and summarize related studies in three folds: deep neural networks, mutation testing, and DL testing.

Our approach

Before describing Boundary Sample Selection (BSS), we state our problem more formally:

Problem. Given a trained original DL model $M_{o}$ and its set $S_{M_{m}}$ of mutant models $S_{M_{m}} = {M_{m_{1}}, M_{m_{2}}, \dots, M_{m_{k}}},$ and an available test set $T$ of test data samples, instead of manual labeling and running all these samples of $T$ on $S_{M_{m}},$ we aim to select a subset $T^{*}$ to label and run samples of $T^{*},$ whose result can replace the result of the whole set $T$ .

In this section, we will describe the idea and the details of our BSS technique to

Experiment setup

In this section, we present the experiment setup to conduct mutation testing for DL models, which aims to evaluate the performance of the proposed method BSS. Our experiment includes the following steps: the generation of original models from datasets (Section 4.1) and DL model structures (Section 4.2), the generation of mutants from mutation operators (Section 4.3), and the experimental evaluation (Section 4.4). To overcome the randomness of DL models, we have repeated the experiments five

Experiment results

In this section, we specify our research questions, along with our motivation, approach, results, and findings. To overcome the randomness of DL models, we have repeated the experiments five times. All the experimental results below are composite results of five repetitions. In each RQ, we will introduce how to summarize the results of these repeated experiments in detail to eliminate the randomness.

On the whole, the performance evaluation parts of our experiments (RQ2, RQ3, and RQ4) are

Discussion

In this part, in order to better understand our proposed BSS, we further discuss four aspects: the performance under different thresholds, sensitivity of the whole test sets, the dissection of boundary samples and benefits from BSS.

Threats to validity

In this section, we will discuss the threats to validity for our study from two aspects: external threat and internal threat.

Conclusion

For traditional software, mutation testing is a methodology to systematically check the test data quality of test data sets. Recently, many researchers introduced mutation testing into DL testing by designing specific mutation operators for DL models. However, mutation testing for DL models is still at an early stage. After analyzing prior studies, we find that there exist three hindering factors: oracle workload, running cost and mutant sensitivity.

We put forward Boundary Sample Selection

CRediT authorship contribution statement

Weijun Shen: Methodology, Writing - original draft, Software. Yanhui Li: Conceptualization, Methodology, Writing - original draft. Yuanlei Han: Software. Lin Chen: Methodology, Writing - review & editing. Di Wu: Software. Yuming Zhou: Methodology, Writing - review & editing. Baowen Xu: Methodology.

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, Boundary Sampling to Boost Mutation Testing for Deep Learning Models.

Acknowledgements

The work is supported by National Key R&D Program of China (Grant No. 2018YFB1003901) and the National Natural Science Foundation of China (Grant No. 61932012, 61872177, 61832009,61772263, and 61772259). We thank the anonymous referees for their helpful comments on this paper.

References (84)

S. Mittal et al.
A survey of techniques for optimizing deep learning on gpus
J. Syst. Archit.
(2019)
Softmax function, (https://en.wikipedia.org/wiki/Softmax_function/). Accessed May 4,...
P. Arcaini et al.
A novel use of equivalent mutants for static anomaly detection in software artifacts
Inf. Softw. Technol.
(2017)
Z. Chen et al.
Positive and negative testing with mutation-driven model checking
GI Jahrestagung
(2012)
A. Alberto et al.
Formal mutation testing for circus
Inf. Softw. Technol.
(2017)
J. López et al.
Source code optimization using equivalent mutants
Inf. Softw. Technol.
(2018)
Z. Zhang et al.
Mutation selection: Some could be better than all.
Proceedings of the 1st International Workshop on Evidential Assessment of Software Technologies
(2011)
L. Madeyski
The impact of test-first programming on branch coverage and mutation score indicator of unit tests: an experiment
Inf. Softw. Technol.
(2010)
M.E. Delamaro et al.
Integration testing using interface mutation
Proceedings of ISSRE ’96: 7th International Symposium on Software Reliability Engineering
(1996)
S. Hong et al.
Museum: debugging real-world multilingual programs using mutation analysis
Inf. Softw. Technol.
(2017)

L. Deng et al.

Mutation operators for testing android apps

Inf. Softw. Technol.

(2017)

B. Aziz

Towards a mutation analysis of iot protocols

Inf. Softw. Technol.

(2018)

M. Wen et al.

Inf. Softw. Technol.

(2017)

M. Gligoric et al.

Inf. Softw. Technol.

(2017)

Y. Jia et al.

Software Testing, Verification and Reliability

(2005)

O. Banias

Test case selection-prioritization approach based on memoization dynamic programming algorithm

Inf. Softw. Technol.

(2019)

A. Arrieta et al.

Pareto efficient multi-objective black-box test case selection for simulation-based testing

Inf. Softw. Technol.

(2019)

V. Garousi et al.

Multi-objective regression test selection in practice: an empirical study in the defense software industry

Inf. Softw. Technol.

(2018)

M. Bures et al.

Employment of multiple algorithms for optimal path-based test selection strategy

Inf. Softw. Technol.

(2019)

J. Hamidzadeh et al.

Irahc: instance reduction algorithm using hyperrectangle clustering

Pattern Recogn.

(2015)

J.M. Zhang et al.

Machine learning testing: survey, landscapes and horizons

IEEE Trans. Software Eng.

(2020)

Y. LeCun et al.

Deep learning

Nature

(2015)

A. Geiger

Are we ready for autonomous driving? the kitti vision benchmark suite

IEEE Conference on Computer Vision and Pattern Recognition

(2012)

O. Abdel-Hamid et al.

Convolutional neural networks for speech recognition

IEEEACM Transactions on Audio Speech & Language Processing

(2014)

C.C. Dan et al.

Flexible, high performance convolutional neural networks for image classification

International Joint Conference on Ijcai

(2011)

I. Sutskever et al.

Sequence to sequence learning with neural networks

Advances in neural information processing systems

(2014)

E. Choi et al.

Using recurrent neural network models for early detection of heart failure onset

Journal of the American Medical Informatics Association Jamia

(2016)

A. Taeihagh, H.S.M. Lim, Governing autonomous vehicles: emerging responses for safety, liability, privacy,...

P. Ammann et al.

Introduction to software testing

(2016)

C. Szegedy et al.

Intriguing properties of neural networks

arXiv preprint arXiv:1312.6199

(2013)

K. Pei et al.

Deepxplore: Automated whitebox testing of deep learning systems

Proceedings of the 26th Symposium on Operating Systems Principles

(2017)

L. Ma et al.

Deepct: Tomographic combinatorial testing for deep learning systems

2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER)

(2019)

X. Yuan et al.

Adversarial examples: attacks and defenses for deep learning

IEEE Trans. Neural Netw. Learn. Syst.

(2019)

Y. Sun et al.

Concolic testing for deep neural networks

Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering

(2018)

L. Ma et al.

Deepmutation: Mutation testing of deep learning systems

2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE)

(2018)

J. Zhang et al.

Predictive mutation testing

IEEE Transactions on Software Engineering

(2018)

J.H. Andrews et al.

Is mutation an appropriate tool for testing experiments? [software testing]

Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005.

(2005)

T.T. Chekam et al.

An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption

2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE)

(2017)

Cited by (16)

Towards mutation testing of Reinforcement Learning systems
2022, Journal of Systems Architecture
Citation Excerpt :
The mutation operators proposed in [7,30] are investigated and evaluated in [31]. [32] presents LEMON, which generates DL models by changing existing models, to test DL libraries. [33] utilizes boundary selection for reducing labeling and running cost of implementing mutation testing for DL systems.
Reinforcement Learning (RL), one of the most active research areas in artificial intelligence, focuses on goal-directed learning from interaction with an uncertain environment. RL systems play an increasingly important role in many aspects. Therefore, its safety issues have received more and more attention. Testing has achieved great success in ensuring safety of traditional software systems. However, traditional testing approaches hardly consider RL systems. To fill this gap, we propose a novel mutation testing framework specialized for RL systems. We propose a series of mutation operators simulating possible errors that RL systems may encounter, and show how to make comprehensive mutations of RL systems with these operators. Furthermore, test environments are provided to reveal possible problems within RL systems. The mutation testing technique can be helpful in the construction of RL systems, and mutation scores specialized for RL systems are used to analyze the extent of potential faults and evaluate the quality of test environments. Our evaluation in three popular environments, namely FrozenLake, CartPole, and MountainCar demonstrates the practicability of the proposed techniques.
How higher order mutant testing performs for deep learning models: A fine-grained evaluation of test effectiveness and efficiency improved from second-order mutant-classification tuples
2022, Information and Software Technology
Citation Excerpt :
We observe that DL models are much different from traditional software: DL models are naturally data-driven [13], which causes the inner pattern to be learned from the training dataset. Due to the different distribution between training and testing datasets, when the size of the testing dataset is large, previous studies [14,18] on DL mutation testing reported that it is easy to kill all mutant models. To overcome this problem, researchers introduced the classification results into consideration and proposed a more strict definition of test cases “killing” mutants, based on which they calculated strong mutation scores [14].
Given the prevalence of Deep Learning (DL) models in daily life, it is crucial to guarantee their reliability by DL testing. Recently, researchers have adapted mutation testing into DL testing to measure the test power of test sets. The bottleneck of DL mutation testing is the expensive costs of generating a large number of mutants.
We want to study whether the traditional ideology of “Higher Order” and “Strongly Subsuming” in Higher Order Mutant Testing is still applicable for DL mutation testing, i.e., whether they can be used to optimize DL mutation testing by reducing the number of mutants.
We propose a new mutation testing framework supporting a fine-grained evaluation of test power, called mutant-classification tuples which consist of mutants and classification categories. Based on mutant-classification tuples, we construct First Order (FOTs) and Higher (Second) Order Tuples (HOTs) by applying mutation operators twice, and search for “Strongly Subsuming” HOTs (SSHOTs) from HOTs.
The experimental results conducted on four widely used datasets and five DL model structures tell us that (1) we can find a considerable number of SSHOTs (from 720 to 25,840 in five models) which can greatly reduce the original set of FOTs (with the reduction ratio from 28.69% to 91.97% in our studied DL models). (2) The reduced tuples by SSHOTs can perform very well in test case selection, since the selected test set is almost the same effective (i.e., with almost the same mutation score) and much more efficient (i.e., with a smaller test size, which is more than 50% reduced) for most studied DL models.
Our study shows that “Higher Order” and “Strongly Subsuming” are useful to optimize DL mutation testing, i.e., SSHOTs can be introduced to reduce the number of mutants and test cases.
When debugging encounters artificial intelligence: state of the art and open challenges
2024, Science China Information Sciences
Evaluation of Test Suite Effectiveness: Problem, Progress, and Challenges
2024, Ruan Jian Xue Bao/Journal of Software
Boosting Adversarial Training in Safety-Critical Systems Through Boundary Data Selection
2023, IEEE Robotics and Automation Letters
Deepkernel: 2d-Kernels Clustering Based Mutant Reduction for Cost-Effective Deep Learning Software Testing
2023, SSRN

View all citing articles on Scopus

View full text

Boundary sampling to boost mutation testing for deep learning models

Abstract

Introduction

Section snippets

Background and related work

Our approach

Experiment setup

Experiment results

Discussion

Threats to validity

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgements

J. Syst. Archit.

Inf. Softw. Technol.

GI Jahrestagung

Inf. Softw. Technol.

Inf. Softw. Technol.

Inf. Softw. Technol.

Inf. Softw. Technol.

Inf. Softw. Technol.

Inf. Softw. Technol.

Inf. Softw. Technol.

Inf. Softw. Technol.

Software Testing, Verification and Reliability

Inf. Softw. Technol.

Inf. Softw. Technol.

Inf. Softw. Technol.

Inf. Softw. Technol.

Pattern Recogn.

IEEE Trans. Software Eng.

Deep learning

Nature

Are we ready for autonomous driving? the kitti vision benchmark suite

IEEE Conference on Computer Vision and Pattern Recognition

Convolutional neural networks for speech recognition

IEEEACM Transactions on Audio Speech & Language Processing

Flexible, high performance convolutional neural networks for image classification

International Joint Conference on Ijcai

Sequence to sequence learning with neural networks

Advances in neural information processing systems

Using recurrent neural network models for early detection of heart failure onset

Journal of the American Medical Informatics Association Jamia

Introduction to software testing

Intriguing properties of neural networks

arXiv preprint arXiv:1312.6199

Deepxplore: Automated whitebox testing of deep learning systems

Proceedings of the 26th Symposium on Operating Systems Principles

Deepct: Tomographic combinatorial testing for deep learning systems

2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Adversarial examples: attacks and defenses for deep learning

IEEE Trans. Neural Netw. Learn. Syst.

Concolic testing for deep neural networks

Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering

Deepmutation: Mutation testing of deep learning systems

2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE)

Predictive mutation testing

IEEE Transactions on Software Engineering

Is mutation an appropriate tool for testing experiments? [software testing]

Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005.

An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption

2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE)