The ratio of equivalent mutants: A key to analyzing mutation equivalence

https://doi.org/10.1016/j.jss.2021.111039Get rights and content

Highlights

  • The REM of a program can be estimated by evaluating its level of redundancy.

  • The REM of a program can be inferred from the mutation operators that are used.

  • The REM of a program can be used to estimate the number of equivalent mutants.

  • The REM of a program can be used to estimate the number of distinct mutants.

  • The REM of a program can be used to generate a minimal set of mutants.

  • The REM of a program can be used to define an equivalence-aware mutation score.

Abstract

Mutation testing is the art of generating syntactic versions (called mutants) of a base program, and is widely used in software testing, most notably the assessment of test suites. Mutants are useful only to the extent that they are semantically distinct from the base program, but some may well be semantically equivalent to the base program, despite being syntactically distinct. Much research has been devoted to identifying, and weeding out, equivalent mutants, but determining whether two programs are semantically equivalent is a non-trivial, tedious, error-prone task. Yet in practice it is not necessary to identify equivalent mutants individually; for most intents and purposes, it suffices to estimate their number. In this paper, we are interested to estimate, for a given number of mutants generated from a program, the ratio of those that are equivalent to the base program; we refer to this as the Ratio of Equivalent Mutants (REM, for short). We argue, on the basis of analytical grounds, that the REM of a program may be estimated from a static analysis of the program, and that it can be used to analyze many mutation related properties of a program. The purpose/ aspiration of this paper is to draw attention to this potentially cost-effective approach to a longstanding stubborn problem.

Section snippets

Mutant equivalence

Mutation is the art of generating syntactic variations of a program P, and is meaningful only to the extent that the syntactic modifications applied to P yield semantic differences; but in practice a mutant M may be syntactically distinct from the base program P yet still compute the exact same function as P. The existence, and pervasiveness, of equivalent mutants is a source of bias and uncertainty in mutation based analysis:

  • If we generate 100 mutants of program P and find that some test suite

Redundancy: the mutants’ elixir of immortality

In Yao et al. (2014) Yao et al. ask the question: what are the causes of mutant equivalence? Mutant equivalence is determined by two factors, namely the mutant operators and the program being mutated. For the sake of argument, we consider a fixed mutation policy (defined by a set of mutant operators) and we reformulate Yao’s question as: What attribute of a program makes it prone to generate equivalent mutants? A program is prone to generate equivalent mutants if it continues to deliver the

Estimating redundancy metrics

In this section we discuss how we are automating the derivation of the redundancy metrics. Among the five redundancy metrics, four (SRI, SRF, FR, NI) pertain to the base program and one (ND) pertains to the oracle that we use to determine equivalence. We resolve to build the regression model using only the four program-specific metrics, then to factor the non-determinacy by means of the following formula: REM=ρ(SRI,SRF,NI,FR)+ND×(1ρ(SRI,SRF,NI,FR)),where ρ(SRI,SRF,NI,FR) is the regression

Estimating the ratio of equivalent mutants

In order to test the validity of our conjecture that redundancy metrics enable us to predict the REM of a program, we conduct an experiment:

  • We consider a set of functions taken from the Apache Commons Mathematics Library and the Apache Commons Lang3 Library (http://apache.org/), a benchmark of software components commonly used in software testing experiments.

  • Each function comes with a test data file, which includes not only the test data proper, but also a test oracle that compares the output

Quantifying mutation redundancy

We consider a program P whose ratio of equivalent mutants is REM and we consider a test suite T and let N be the number of mutants that T kills. We cannot tell how good set T is unless we know how many distinct mutants the set of killed mutants contains. What really measures the effectiveness of T is not N, but rather the number of equivalence classes of the set of killed mutants; the question that we must address then is, how do we estimate the number of equivalence classes in a set of N

Mutation score

When we run M mutants of program P on some test suite T and we find that X mutants are killed, it is customary to view the ratio MS(M,X)=XM as a measure of effectiveness of T, called the mutation score of T. We argue that this formula suffers from two major flaws:

  • The denominator ought to reflect the fact that some of the M mutants are equivalent to P, hence no test data can kill them.

  • Both the numerator and the denominator ought to be quantified not in terms of the number of mutants, but instead

Extraction of minimal set

If we learn anything from Section 5, it is that the number of distinct mutants in a set of size N can be much smaller than N, even for very small values of REM. This raises the question: how can we identify a minimal set of distinct mutants that includes one representative from each equivalence class (to be as good as the whole set) and includes no more than one representative from each equivalence class (for the sake of minimality). In other words, given a set of size N partitioned by an

Impact of mutant generation policy

So far, we have analyzed the REM of a program by focusing solely on the program, assuming a fixed mutant generation policy; but the REM also depends on the mutant generation policy, specifically, on the set of mutant operators that we deploy. We see two possible approaches to integrating the mutant generation policy with the analysis of the program’s attributes:

  • Either select some special mutant generation policies, such as those that are implemented in common tools (Coles, 2017, Ma and Offutt,

Assessment

In this section we present an assessment of our approach by showing a use-case of how we envision to compute and use the REM of a program (Section 9.1) then we discuss threats to the validity of our REM-based approach (Section 9.2).

Conclusion and prospects

Our goal in this paper is to draw researchers’ attention to a venue of mutation testing research that has not been explored much so far, yet promises to yield useful results at low cost.

CRediT authorship contribution statement

Imen Marsit: The first to discover the correlation between redundancy metrics and REM, Responsible for much of the data collection, compilation and analysis, including the data in sections 3 and 8. Amani Ayad: Responsible for developing the system for calculating redundancy metrics and for deriving the regression model using the redundancy metrics as calculated by her tool. David Kim: An undergraduate directed study course on testing the hypothesis that all mutants have the same REM. Monsour

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partly supported by NSF, United States grant number DGE1565478. The authors are very grateful to the anonymous reviewers for their valuable/insightful feedback, which has greatly helped us to improve the paper and enrich its content.

Dr. Imen Marsit is an assistant professor of computer science at the University of Sousse, Tunisia.

References (67)

  • MorellL.J. et al.

    A framework for defining semantic metrics

    J. Syst. Softw.

    (1993)
  • PapadakisM. et al.

    Mitigating the effects of equivalent mutants with mutant clasification strategies

    Sci. Comput. Program.

    (2014)
  • VoasJ.M. et al.

    Semantic metrics for software testability

    J. Syst. Softw.

    (1993)
  • Aadamopoulos, K., Harman, M., Hierons, R., 2004. How to overcome the equivalent mutant problem and achieve tailored...
  • AbranA.

    Software Metrics and Software Metrology

    (2012)
  • Ammann, P., Delamaro, M., Offutt, J., 2014. Establishing theoretical minimal sets of mutants. In: Proceedings, ICST...
  • Andrews, J., Briand, L., Labiche, Y., 2005. Is mutation an appropriate tool for testing experiments?. In: Proceedings,...
  • Androutsopoulos, K., Clark, D., Dan, H., Hierons, R.M., Harman, M., 2014. An analysis of the relationship between...
  • AvizienisA. et al.

    Basic concepts and taxonomy of dependable and secure computing

    IEEE Trans. Dependable Secure Comput.

    (2004)
  • AyadA.

    Quantitative Metrics for Mutation TestingTechnical Report

    (2019)
  • Ayad, A., Marsit, I., Loh, J., Omri, M.N., Mili, A., 2019a. Estimating the number of equivalent mutants. In:...
  • AyadA. et al.

    Quanatitative metrics for mutation testing

  • AyadA. et al.

    Using semantic metrics to predict mutation equivalence

  • Siemens SuiteTechnical Report

    (2007)
  • BoehmB.

    Software Engineering Economics

    (1981)
  • BoehmB.W. et al.

    Software Cost Estimation with COCOMO II

    (2000)
  • BoehmeM.

    Stads: Software testing as species discovery

    ACM TOSEM

    (2018)
  • BuddT.A. et al.

    Two notions of correctness and their relation to testing

    Acta Inform.

    (1982)
  • Budd, T.A., DeMillo, R.A., Lipton, R.J., Sayward, F., 1980. Theoretical and empirical studies on using program mutation...
  • Carvalho, L., Guimares, M., Fernandes, L., Hajjaji, M.A., Gheyi, R., Thuem, T., 2018. Equivalent mutants in...
  • ChaoA. et al.

    Species Richness: Estimation and Comparison

    (2014)
  • Chekam, T.T., Papadakis, M., Bissiyande, T.F., LeTraon, Y., Sen, K., 2020. Selecting fault revealing mutants. In:...
  • ChidamberS.R. et al.

    A suite for object oriented design

    IEEE TSE

    (1994)
  • ClarkD. et al.

    Squeeziness: An information theoretic measure for avoiding fault masking

    Inform. Process. Lett.

    (2012)
  • ClarkD. et al.

    Normalized squeeziness and failed error propagation

    Inform. Process. Lett.

    (2019)
  • ColesH.

    Real World Mutation TestingTechnical Report

    (2017)
  • CsiszarI. et al.

    Information Theory: Coding Theorems for Discrete Memoryless Systems

    (2011)
  • DialloN. et al.

    Correctness and relative correctness

  • FentonN.

    Software Metrics: A Rigorous Approach

    (1991)
  • FentonN.E. et al.

    Software Metrics: A Rigorous and Practical Approach

    (1997)
  • Gopinath, R., Alipour, A., Ahmed, I., Jensen, C., Groce, A., 2016. Measuring effectiveness of mutant sets. In:...
  • GranoG. et al.

    Lightweight assessment of test case effectiveness using source code quality indicators

    IEEE Trans. Softw. Eng.

    (2019)
  • GruenB. et al.

    The impact of equivalent mutants

  • Cited by (2)

    Dr. Imen Marsit is an assistant professor of computer science at the University of Sousse, Tunisia.

    Dr. Amani Ayad is an assistant professor of computer science at SUNY Farmingdale in New York, USA.

    David Kim is an Honors undergraduate student at NJIT, Newark, NJ USA.

    Monsour Latif is a graduate student at NJIT, Newark, NJ USA.

    Dr. JiMeng Loh is an associate professor of mathematics at NJIT, Newark NJ USA.

    Dr. Mohamed Nazih Omri is a professor of computer science at the University of Sousse, Tunisia.

    Dr. Ali Mili is a professor of computer science at NJIT, Newark NJ USA.

    Editor: W. Eric Wong.

    View full text