Boundary sampling to boost mutation testing for deep learning models

https://doi.org/10.1016/j.infsof.2020.106413Get rights and content

Abstract

Context: The prevalent application of Deep Learning (DL) models has raised concerns about their reliability. Due to the data-driven programming paradigm, the quality of test datasets is extremely important to gain accurate assessment of DL models. Recently, researchers have introduced mutation testing into DL testing, which applies mutation operators to generate mutants from DL models, and observes whether the test data can identify mutants to check the quality of test dataset. However, there still exist many factors (e.g., huge labeling efforts and high running cost) hindering the implementation of mutation testing for DL models.

Objective: We desire for an approach to selecting a smaller, sensitive, representative and efficient subset of the whole test dataset to promote the current mutation testing (e.g., reduce labeling and running cost) for DL Models.

Method: We propose boundary sample selection (BSS), which employs the distance of samples to decision boundary of DL models as the indicator to construct the appropriate subset. To evaluate the performance of BSS, we conduct an extensive empirical study with two widely-used datasets, three popular DL models, and 14 up-to-date DL mutation operators. Results

: We observe that (1) The sizes of our subsets generated by BSS are much smaller (about 3%-20% of the whole test set). (2) Under most mutation operators, our subsets are superior (about 9.94-21.63) than the whole test sets in observing mutation effects. (3) Our subsets could replace the whole test sets to a very high degree (higher than 97%) when considering mutation score. (4) The MRR values of our proposed subsets are clearly better (about 2.28-13.19 times higher) than that of the whole test sets.

Conclusions: The result shows that BSS can help testers save labelling cost, run mutation testing quickly and identify killed mutants early.

Introduction

Deep Learning (DL) models compose simple but non-linear modules from multiple levels of deep neural networks (DNNs) to construct a higher and more abstract representation of raw inputs, which aim to amplify the dimensions of the inputs that are important and helpful for classification tasks [1]. Massive labeled training data [2] and the explosion of computing abilities [3] accelerate the development and popularization of DL techniques, which have brought about breakthroughs in many areas, including speech recognition [4], image recognition [5], natural language processing [6], and so on.

The prevalent application of DL models has raised concerns about the quality and reliability of the models, especially for safety-critical domains (e.g., disease diagnosis [7] and autonomous driving [8]). Software testing furnishes an objective evaluation of software quality by comparing the required behavior and the actual behavior of software systems [9], which has been evolved into one of the mainstream technology to construct DL testing, by exposing defects and improving the reliability of DL models [10], [11], [12], [13], [14].

Due to the data-driven programming paradigm of DL models, evaluating the test power of test dataset can be one of the most important parts of DL testing [15]. In DL testing, testers may benefit from knowing the power of the test suite in detecting defects of DL models, since the test suite with stronger power is prone to detect more defects [16]. For traditional software, mutation testing is a methodology to systematically check the test power of test suite, which employs mutation operators to generate faulty versions (i.e., mutants) of the software, and checks to what extent the test suite can detect these faults in mutants [17]. Previous results of traditional software have shown that mutation testing has the best performance in terms of fault revelation [18]. Recently, researchers introduced mutation testing into DL testing by designing and implementing mutation operations of DL models. Shen et al. proposed MuNN, a mutation analysis method for DL models, which consists of five mutation operators to construct the mutants of DL models [19]. Ma et al. proposed a mutation testing framework named DeepMutation to measure the quality of test suite, including a set of source-level and model-level mutation operators [15]. Wang et al. proposed a method to detect adversarial samples for DL models through mutation testing [20].

Currently, mutation testing for DL models is still at an early stage. After analyzing the results and findings of prior studies, we find out that there exist the following three factors hindering the implementation of mutation testing for DL models:

  • Oracle workload. The oracle refers to the label information of test data in DL testing. For the testers, the primary challenge is to label the test data with very huge human labors. For example, as reported1, a skilled worker in labelling pictures can only deal with 40 pictures per day in average.

  • Running cost. Mutation testing can be very expensive, which has already been discussed in traditional software [21]. It requires the generation of mutants based on the mutation operators, and the execution of mutants against the whole test data. Compared with traditional software, the amount of training data2 and test data is very large with the development of DL models. This makes the execution of mutants against the large-scale test data very time-consuming.

  • Mutant sensitivity. Different from traditional software, DL mutation operators contain randomness and uncertainty (e.g., adding Gaussian noise to weight values in the structure of ANNs), so we need to observe the mutation effects to assess the validity of operators. The result of previous studies [15], [19] showed that the most of inputs from test data are not sensitive to DL mutants, i.e., the output results of mutants against the most of inputs are the same as the ones of original models, which is difficulty to observe the difference (e.g., from the point of test accuracy) between the original models and their mutants.

To address these above, testers have their desire for an approach to promoting the current mutation testing activities in DL testing, by reducing oracle workload, decreasing running cost, and increasing mutant sensitivity. Motivated by this purpose, we focus on the following problem: Can we obtain results of mutation testing on DL models, by selecting a smaller, sensitive, representative and efficient subset of the whole test data to implement mutation testing?”; and we propose Boundary Sample Selection (BSS) as the affirmative answer of our focused problem.

To the best of our knowledge, BSS is the first DL mutation testing technique that selects a subset of the whole test data to implement mutation testing. BSS only requires the output probability values, not the real labels. Specifically, BSS is based on the following intuition that the samples near decision boundary of the original DL models are more sensitive to the effects of mutation operations. That inspires us to employ the distance of samples to decision boundary as the indicator to select the subset of samples. To measure the distance, we carry out the idea that the more easily a sample can be misclassified, the closer to boundary it should be (see detail in Section 3). To sum up, instead of evaluating test power using the whole test data, BSS selects the appropriate subset near the decision boundary of DL models, which would cut down labeling workload3 and running cost, and make testers more easy to observe the effects of mutation operations.

Here we give a brief framework with the flow of activities performed to evaluate the performance of BSS in our approach4 in Fig. 1. First, we train three original models from the datasets and DL model structures. BSS has been evaluated on 2 widely used dataset MNIST [22] and CIFAR-10 [23] under three popular DL model structures [24], [25], [26]. Next, we apply mutation operators (14 up-to-date mutation operators [15], [19] from both source-level and model-level) to generate mutated models. For each DL model, we construct more than 300 mutants to conduct our experiments. Finally, we evaluate the performance of our BSS in mutation testing for DL models by four research questions. Based on the original models, the boundary samples are identified by the distance to decision boundary, by which we set a threshold to select the most near samples from the whole test samples. We introduce a set of metrics (e.g., normal and stronger mutation score) to compare the results of running mutation testing on the generated subsets by BSS and the whole test data. To examine whether BSS performs well in DL mutation testing, we organize our experiments by addressing the following four research questions (RQs):

  • RQ1 (Small size): are the sizes of the subsets generated by BSS much smaller than the whole test data?

    The result of our experiments with subsets generated by BSS shows that the sizes of the subsets generated by BSS are much smaller (about 3% to 20% of the whole test data), and are stable with coefficients of variation lower than or equal to 0.1.

  • RQ2 (Sensibility): are the subsets generated by BSS more sensitive to mutation operations of DL models?

    We compare the performance of boundary samples with the whole test set in observing the performance fluctuation caused by mutation. We use the differences of accuracy (between original and mutated models) as the performance indicator. The experiment results show that the subsets generated by BSS are superior in observing the effect of mutation operators. Under most mutation operators, the subsets are superior (from 9.94 to 21.63) than the whole test set.

  • RQ3 (Representativeness): are the subsets generated by BSS representative to evaluate the results of mutation testing under the whole test data?

    In this RQ, we apply widely used indicators (i.e., mutation score) to evaluate whether the subsets generated by BSS could substitute the whole test set for DL mutation testing. We observe that the subsets generated by BSS could replace the whole test set to a very high degree when considering normal mutation score (close to 100%) and stronger mutation score (higher than 97%) in our experiments.

  • RQ4 (Efficiency): are the subsets generated by BSS more efficient than the whole test set under DL mutation testing?

    In order to analyze the efficiency, we apply the efficiency evaluation metric Mean Reciprocal Rank (MRR), which has been widely used in retrieve algorithms. The result shows that the MRR values of our proposed subsets are clearly better (on average, 8.30, 13.19, and 2.28 times greater5 for three DL models, respectively) than that of the whole test set.

The main contributions of this paper are summarized as follows:

  • We put forward Boundary Sample Selection (BSS), the first DL mutation testing technique that selects a subset of the whole test set to implement mutation testing, which can optimize the process by reducing labeling and running cost and improving mutant sensitivity.

  • We conduct an empirical experiment with 14 popular mutation operators and nearly 5000 mutants of DL models to check the performance of BSS. The result of our experiments shows that BSS is very useful in mutation testing, i.e., the subset generated by BSS is much smaller, more sensitive, almost completely representative and more efficient, compared with the whole test set.

The rest of our paper is organized as follows. The background and related work are presented in Section 2. In Section 3, we define the boundary sample and describe Boundary Sample Selection in detail. The experimental settings are introduced in Section 4, including studied datasets, subject models, mutation operators, research questions, and so on. The experimental results and findings are explained in Section 5. Sections 6 and 7 further discuss some important experimental details and threats to validity of our study, respectively. Section 8 concludes our paper.

Section snippets

Background and related work

In this section, we elaborate the basic structure of deep neural networks and summarize related studies in three folds: deep neural networks, mutation testing, and DL testing.

Our approach

Before describing Boundary Sample Selection (BSS), we state our problem more formally:

Problem. Given a trained original DL model Moand its set SMmof mutant models SMm={Mm1,Mm2,,Mmk},and an available test set Tof test data samples, instead of manual labeling and running all these samples of Ton SMm,we aim to select a subset T*to label and run samples of T*,whose result can replace the result of the whole set T.

In this section, we will describe the idea and the details of our BSS technique to

Experiment setup

In this section, we present the experiment setup to conduct mutation testing for DL models, which aims to evaluate the performance of the proposed method BSS. Our experiment includes the following steps: the generation of original models from datasets (Section 4.1) and DL model structures (Section 4.2), the generation of mutants from mutation operators (Section 4.3), and the experimental evaluation (Section 4.4). To overcome the randomness of DL models, we have repeated the experiments five

Experiment results

In this section, we specify our research questions, along with our motivation, approach, results, and findings. To overcome the randomness of DL models, we have repeated the experiments five times. All the experimental results below are composite results of five repetitions. In each RQ, we will introduce how to summarize the results of these repeated experiments in detail to eliminate the randomness.

On the whole, the performance evaluation parts of our experiments (RQ2, RQ3, and RQ4) are

Discussion

In this part, in order to better understand our proposed BSS, we further discuss four aspects: the performance under different thresholds, sensitivity of the whole test sets, the dissection of boundary samples and benefits from BSS.

Threats to validity

In this section, we will discuss the threats to validity for our study from two aspects: external threat and internal threat.

Conclusion

For traditional software, mutation testing is a methodology to systematically check the test data quality of test data sets. Recently, many researchers introduced mutation testing into DL testing by designing specific mutation operators for DL models. However, mutation testing for DL models is still at an early stage. After analyzing prior studies, we find that there exist three hindering factors: oracle workload, running cost and mutant sensitivity.

We put forward Boundary Sample Selection

CRediT authorship contribution statement

Weijun Shen: Methodology, Writing - original draft, Software. Yanhui Li: Conceptualization, Methodology, Writing - original draft. Yuanlei Han: Software. Lin Chen: Methodology, Writing - review & editing. Di Wu: Software. Yuming Zhou: Methodology, Writing - review & editing. Baowen Xu: Methodology.

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, Boundary Sampling to Boost Mutation Testing for Deep Learning Models.

Acknowledgements

The work is supported by National Key R&D Program of China (Grant No. 2018YFB1003901) and the National Natural Science Foundation of China (Grant No. 61932012, 61872177, 61832009,61772263, and 61772259). We thank the anonymous referees for their helpful comments on this paper.

References (84)

  • L. Deng et al.

    Mutation operators for testing android apps

    Inf. Softw. Technol.

    (2017)
  • B. Aziz

    Towards a mutation analysis of iot protocols

    Inf. Softw. Technol.

    (2018)
  • M. Wen et al.

    Exposing library api misuses via mutation analysis

    Proceedings of the 41st International Conference on Software Engineering

    (2019)
  • F. Wu et al.

    Memory mutation testing

    Inf. Softw. Technol.

    (2017)
  • M. Gligoric et al.

    Selective mutation testing for concurrent code

    Proceedings of the 2013 International Symposium on Software Testing and Analysis

    (2013)
  • D. Gong et al.

    Mutant reduction based on dominance relation for weak mutation testing

    Inf. Softw. Technol.

    (2017)
  • Y. Jia et al.

    Constructing subtle faults using higher order mutation testing

    2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation

    (2008)
  • M. Sahinoglu et al.

    A bayes sequential statistical procedure for approving software products

    Proceedings of the IFIP Conference on Approving Software Products (ASP90)

    (1990)
  • Y.-S. Ma et al.

    Mujava: an automated class mutation system: research articles

    Software Testing, Verification and Reliability

    (2005)
  • O. Banias

    Test case selection-prioritization approach based on memoization dynamic programming algorithm

    Inf. Softw. Technol.

    (2019)
  • A. Arrieta et al.

    Pareto efficient multi-objective black-box test case selection for simulation-based testing

    Inf. Softw. Technol.

    (2019)
  • V. Garousi et al.

    Multi-objective regression test selection in practice: an empirical study in the defense software industry

    Inf. Softw. Technol.

    (2018)
  • M. Bures et al.

    Employment of multiple algorithms for optimal path-based test selection strategy

    Inf. Softw. Technol.

    (2019)
  • J. Hamidzadeh et al.

    Irahc: instance reduction algorithm using hyperrectangle clustering

    Pattern Recogn.

    (2015)
  • J.M. Zhang et al.

    Machine learning testing: survey, landscapes and horizons

    IEEE Trans. Software Eng.

    (2020)
  • Y. LeCun et al.

    Deep learning

    Nature

    (2015)
  • A. Geiger

    Are we ready for autonomous driving? the kitti vision benchmark suite

    IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • O. Abdel-Hamid et al.

    Convolutional neural networks for speech recognition

    IEEEACM Transactions on Audio Speech & Language Processing

    (2014)
  • C.C. Dan et al.

    Flexible, high performance convolutional neural networks for image classification

    International Joint Conference on Ijcai

    (2011)
  • I. Sutskever et al.

    Sequence to sequence learning with neural networks

    Advances in neural information processing systems

    (2014)
  • E. Choi et al.

    Using recurrent neural network models for early detection of heart failure onset

    Journal of the American Medical Informatics Association Jamia

    (2016)
  • A. Taeihagh, H.S.M. Lim, Governing autonomous vehicles: emerging responses for safety, liability, privacy,...
  • P. Ammann et al.

    Introduction to software testing

    (2016)
  • C. Szegedy et al.

    Intriguing properties of neural networks

    arXiv preprint arXiv:1312.6199

    (2013)
  • K. Pei et al.

    Deepxplore: Automated whitebox testing of deep learning systems

    Proceedings of the 26th Symposium on Operating Systems Principles

    (2017)
  • L. Ma et al.

    Deepct: Tomographic combinatorial testing for deep learning systems

    2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER)

    (2019)
  • X. Yuan et al.

    Adversarial examples: attacks and defenses for deep learning

    IEEE Trans. Neural Netw. Learn. Syst.

    (2019)
  • Y. Sun et al.

    Concolic testing for deep neural networks

    Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering

    (2018)
  • L. Ma et al.

    Deepmutation: Mutation testing of deep learning systems

    2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE)

    (2018)
  • J. Zhang et al.

    Predictive mutation testing

    IEEE Transactions on Software Engineering

    (2018)
  • J.H. Andrews et al.

    Is mutation an appropriate tool for testing experiments? [software testing]

    Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005.

    (2005)
  • T.T. Chekam et al.

    An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption

    2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE)

    (2017)
  • Cited by (16)

    • Towards mutation testing of Reinforcement Learning systems

      2022, Journal of Systems Architecture
      Citation Excerpt :

      The mutation operators proposed in [7,30] are investigated and evaluated in [31]. [32] presents LEMON, which generates DL models by changing existing models, to test DL libraries. [33] utilizes boundary selection for reducing labeling and running cost of implementing mutation testing for DL systems.

    • How higher order mutant testing performs for deep learning models: A fine-grained evaluation of test effectiveness and efficiency improved from second-order mutant-classification tuples

      2022, Information and Software Technology
      Citation Excerpt :

      We observe that DL models are much different from traditional software: DL models are naturally data-driven [13], which causes the inner pattern to be learned from the training dataset. Due to the different distribution between training and testing datasets, when the size of the testing dataset is large, previous studies [14,18] on DL mutation testing reported that it is easy to kill all mutant models. To overcome this problem, researchers introduced the classification results into consideration and proposed a more strict definition of test cases “killing” mutants, based on which they calculated strong mutation scores [14].

    View all citing articles on Scopus
    View full text