How verified (or tested) is my code? Falsification-driven verification and testing

Groce, Alex; Ahmed, Iftekhar; Jensen, Carlos; McKenney, Paul E.; Holmes, Josie

doi:10.1007/s10515-018-0240-y

How verified (or tested) is my code? Falsification-driven verification and testing

Published: 11 July 2018

Volume 25, pages 917–960, (2018)
Cite this article

Automated Software Engineering Aims and scope Submit manuscript

Alex Groce¹,
Iftekhar Ahmed²,
Carlos Jensen²,
Paul E. McKenney³ &
…
Josie Holmes¹

659 Accesses
5 Citations
Explore all metrics

Abstract

Formal verification has advanced to the point that developers can verify the correctness of small, critical modules. Unfortunately, despite considerable efforts, determining if a “verification” verifies what the author intends is still difficult. Previous approaches are difficult to understand and often limited in applicability. Developers need verification coverage in terms of the software they are verifying, not model checking diagnostics. We propose a methodology to allow developers to determine (and correct) what it is that they have verified, and tools to support that methodology. Our basic approach is based on a novel variation of mutation analysis and the idea of verification driven by falsification. We use the CBMC model checker to show that this approach is applicable not only to simple data structures and sorting routines, and verification of a routine in Mozilla’s JavaScript engine, but to understanding an ongoing effort to verify the Linux kernel read-copy-update mechanism. Moreover, we show that despite the probabilistic nature of random testing and the tendency to incompleteness of testing as opposed to verification, the same techniques, with suitable modifications, apply to automated test generation as well as to formal verification. In essence, it is the number of surviving mutants that drives the scalability of our methods, not the underlying method for detecting faults in a program. From the point of view of a Popperian analysis where an unkilled mutant is a weakness (in terms of its falsifiability) in a “scientific theory” of program behavior, it is only the number of weaknesses to be examined by a user that is important.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Automatic software refactoring: a systematic literature review

Article 03 December 2019

Abdulrahman Ahmed Bobakr Baqais & Mohammad Alshayeb

A systematic review of fuzzing

Article 31 October 2023

Xiaoqi Zhao, Haipeng Qu, … Gai-Ge Wang

An empirical study of automated unit test generation for Python

Article Open access 31 January 2023

Stephan Lukasczyk, Florian Kroiß & Gordon Fraser

Notes

By a harness we mean a program that defines an environment and the form of valid tests, and provides correctness properties.
In fact, that actual code is incorrect, with an access a[i] that does not properly use short circuiting logical operators to protect array bounds; CBMC detected this, and we fixed it for this paper.
In our own practice, the most common way of setting it is to guess a bound and see if the resulting problem is too large for the available resources.
There is one noted exception in Sect. 4.4.
We show the output of the print statements, not the full CBMC trace: this is what a developer will examine first.
In fact, if we choose a val to check before we assign to ref, we could completely dispense with storing ref at all.
See the issues labeled with TSTL on the pyfakefs GitHub issue tracker for a history of the testing effort.
Our terminology here is not quite Popper’s, which is somewhat difficult to follow without a lengthy introduction to his classification of statements.
In fact, Kaner, Bach, and Pettichord explicitly mention Popper, though only in the context of using tests to refute conjectures about the correctness of software, not in the context of attempting to refute the testing effort itself.
Note that we use a model checking approach that already guarantees non-spurious counterexamples, and provides bounded rather than full verification.

References

Ahmed, I., Gopinath, R., Brindescu, C., Groce, A., Jensen, C.: Can testedness be effectively measured? In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pp. 547–558. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2950290.2950324
Ahmed, I., Jensen, C., Groce, A., McKenney, P.E.: Applying mutation analysis on kernel test suites: an experience report. In: International Workshop on Mutation Analysis, pp. 110–115 (2017)
Aichernig, B.K.: Model-based mutation testing of reactive systems. In: Theories of Programming and Formal Methods, pp. 23–36. Springer (2013)
Alipour, M.A., Groce, A., Zhang, C., Sanadaji, A., Caushik, G.: Finding model-checkable needles in large source code haystacks: Modular bug-finding via static analysis and dynamic invariant discovery. In: International Workshop on Constraints in Formal Verification (2013)
Andrews, J.H., Briand, L.C., Labiche, Y.: Is mutation an appropriate tool for testing experiments? In: International Conference on Software Engineering, pp. 402–411 (2005)
Andrews, J.H., Groce, A., Weston, M., Xu, R.G.: Random test run length and effectiveness. In: Automated Software Engineering, pp. 19–28 (2008)
Andrews, J.H., Briand, L.C., Labiche, Y., Namin, A.S.: Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Trans. Softw. Eng. 32(8), 608 (2006)
Article Google Scholar
Arcuri, A., Briand, L.: A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test. Verif. Reliab. 24(3), 219–250 (2014)
Article Google Scholar
Auerbach, G., Copty, F., Paruthi, V.: Formal verification of arbiters using property strengthening and underapproximations. In: Formal Methods in Computer-Aided Design, pp. 21–24 (2010)
Ball, T., Kupferman, O., Yorsh, G.: Abstraction for falsification. In: Computer Aided Verification, pp. 67–81 (2005)
Chapter Google Scholar
Barr, E.T., Harman, M., McMinn, P., Shahbaz, M., Yoo, S.: The oracle problem in software testing: a survey. IEEE Trans. Softw. Eng. 41(5), 507–525 (2015)
Article Google Scholar
Bentley, J.: Programming pearls: writing correct programs. Commun. ACM 26(12), 1040–1045 (1983)
Article Google Scholar
Black, P.E., Okun, V., Yesha, Y.: Mutation of model checker specifications for test generation and evaluation. Mutation 2000, 14–20 (2000)
Google Scholar
Bloch, J.: Extra, extra - read all about it: nearly all binary searches and mergesorts are broken. http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html (2006)
bremen, mrbean, jmcgeheeiv, et al.: pyfakefs implements a fake file system that mocks the Python file system modules. https://github.com/jmcgeheeiv/pyfakefs (2011)
Budd, T.A., DeMillo, R.A., Lipton, R.J., Sayward, F.G.: Theoretical and empirical studies on using program mutation to test the functional correctness of programs. In: Principles of Programming Languages, pp. 220–233. ACM (1980)
Budd, T.A.: Mutation analysis of program test data. Ph.D. thesis, Yale University, New Haven, CT, USA (1980)
Budd, T.A., Lipton, R.J., DeMillo, R.A., Sayward, F.G.: Mutation Analysis. Yale University, Department of Computer Science, New Haven (1979)
Book Google Scholar
Buxton, J.N., Randell, B.: Report of a conference sponsored by the NATO science committee. In: NATO Software Engineering Conference, vol. 1969 (1969)
Chen, Y., Groce, A., Zhang, C., Wong, W.K., Fern, X., Eide, E., Regehr, J.: Taming compiler fuzzers. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 197–208 (2013)
Chockler, H., Gurfinkel, A., Strichman, O.: Beyond vacuity: Towards the strongest passing formula. In: Proceedings of the 2008 International Conference on Formal Methods in Computer-Aided Design, pp. 24:1–24:8 (2008)
Chockler, H., Kupferman, O., Kurshan, R.P., Vardi, M.Y.: A practical approach to coverage in model checking. In: Computer Aided Verification, pp. 66–78 (2001)
Chapter Google Scholar
Chockler, H., Kroening, D., Purandare, M.: Computing mutation coverage in interpolation-based model checking. IEEE Trans. CAD Integr. Circuits Syst. 31(5), 765–778 (2012)
Article Google Scholar
Clarke, E., Grumberg, O., McMillan, K., Zhao, X.: Efficient generation of counterexamples and witnesses in symbolic model checking. In: Design Automation Conference, pp. 427–432 (1995)
Clarke, E.M., Grumberg, O., Peled, D.: Model Checking. MIT Press, Cambridge (2000)
Google Scholar
Cuoq, P., Monate, B., Pacalet, A., Prevosto, V., Regehr, J., Yakobowski, B., Yang, X.: Testing static analyzers with randomly generated programs. In: NASA Formal Methods Symposium, pp. 120–125 (2012)
Google Scholar
Daran, M., Thévenod-Fosse, P.: Software error analysis: A real case study involving real faults and mutations. In: ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 158–171. ACM (1996)
de Moura, L.M., Bjørner, N.: Z3: an efficient SMT solver. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 337–340 (2008)
DeMillo, R.A., Lipton, R.J., Sayward, F.G.: Hints on test data selection: help for the practicing programmer. Computer 4(11), 34 (1978)
Article Google Scholar
Desnoyers, M.: [RFC git tree] userspace RCU (urcu) for Linux (2009). http://urcu.so
Desnoyers, M., McKenney, P.E., Stern, A., Dagenais, M.R., Walpole, J.: User-level implementations of read-copy update. IEEE Trans. Parallel Distrib. Syst. 23, 375–382 (2012). https://doi.org/10.1109/TPDS.2011.159
Article Google Scholar
Dijkstra, E.W.: A Discipline of Programming. Prentice-Hall, Englewood Cliffs, NJ (1976)
MATH Google Scholar
Een, N., Sorensson, N.: An extensible SAT-solver. In: Symposium on the Theory and Applications of Satisfiability Testing (SAT), pp. 502–518 (2003)
Chapter Google Scholar
Ghassabani, E., Gacek, A., Whalen, M.W., Heimdahl, M.P.E., Wagner, L.G.: Proof-based coverage metrics for formal verification. In: IEEE/ACM International Conference on Automated Software Engineering, pp. 194–199 (2017a)
Ghassabani, E., Gacek, A., Whalen, M.W.: Efficient generation of inductive validity cores for safety properties. In: ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 314–325 (2016)
Ghassabani, E., Whalen, M.W., Gacek, A.: Efficient generation of all minimal inductive validity cores. In: FMCAD, pp. 31–38 (2017b)
Gligoric, M., Groce, A., Zhang, C., Sharma, R., Alipour, A., Marinov, D.: Comparing non-adequate test suites using coverage criteria. In: International Symposium on Software Testing and Analysis, pp. 302–313 (2013)
Gligoric, M., Groce, A., Zhang, C., Sharma, R., Alipour, A., Marinov, D.: Guidelines for coverage-based comparisons of non-adequate test suites. ACM Trans. Softw. Eng. Methodol.(accepted for publication)
Gopinath, R., Jensen, C., Groce, A.: Code coverage for suite evaluation by developers. In: International Conference on Software Engineering, pp. 72–82 (2014)
Groce, A., Ahmed, I., Jensen, C., McKenney, P.E.: How verified is my code? falsification-driven verification. In: 30th IEEE/ACM International Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, USA, November 9–13, 2015, pp. 737–748 (2015). https://doi.org/10.1109/ASE.2015.40
Groce, A., Alipour, M.A., Gopinath, R.: Coverage and its discontents. In: Onward! Essays, pp. 255–268 (2014)
Groce, A., Erwig, M.: Finding common ground: choose, assert, and assume. In: Workshop on Dynamic Analysis, pp. 12–17 (2012)
Groce, A., Holmes, J., Marinov, D., Shi, A., Zhang, L.: An extensible, regular-expression-based tool for multi-language mutant generation. In: International Conference on Software Engineering (2018)
Groce, A., Holzmann, G., Joshi, R., Xu, R.G.: Putting flight software through the paces with testing, model checking, and constraint-solving. In: Workshop on Constraints in Formal Verification, pp. 1–15 (2008)
Groce, A., Holzmann, G., Joshi, R.: Randomized differential testing as a prelude to formal verification. In: International Conference on Software Engineering, pp. 621–631 (2007)
Groce, A., Joshi, R.: Exploiting traces in program analysis. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 379–393 (2006)
Chapter Google Scholar
Groce, A., Joshi, R.: Random testing and model checking: Building a common framework for nondeterministic exploration. In: Workshop on Dynamic Analysis, pp. 22–28 (2008)
Groce, A., Pinto, J., Azimi, P., Mittal, P., Holmes, J., Kellar, K.: TSTL: the template scripting testing language. https://github.com/agroce/tstl (2015b)
Groce, A., Pinto, J., Azimi, P., Mittal, P.: TSTL: a language and tool for testing (demo). In: ACM International Symposium on Software Testing and Analysis, pp. 414–417 (2015a)
Groce, A., Pinto, J.: A little language for testing. In: NASA Formal Methods Symposium, pp. 204–218 (2015)
Google Scholar
Groce, A.: Error explanation with distance metrics. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 108–122 (2004)
Chapter Google Scholar
Groce, A.: Quickly testing the tester via path coverage. In: Workshop on Dynamic Analysis (2009)
Groce, A., Kroening, D.: Making the most of BMC counter examples. Electron. Notes Theor. Comput. Sci. 119(2), 67–81 (2005)
Article Google Scholar
Guniguntala, D., McKenney, P.E., Triplett, J., Walpole, J.: The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux. IBM Syst. J. 47(2), 221–236 (2008)
Article Google Scholar
Holmes, J., Groce, A., Pinto, J., Mittal, P., Azimi, P., Kellar, K., O’Brien, J.: TSTL: the template scripting testing language. Int. J. Softw. Tools Technol. Transf. (2016). https://doi.org/10.1007/s10009-016-0445-y. (Online first)
Article Google Scholar
Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)
Article Google Scholar
Hoskote, Y., Kam, T., Ho, P.H., Zhao, X.: Coverage estimation for symbolic model checking. In: ACM/IEEE Design Automation Conference, pp. 300–305 (1999)
Hume, D.: An Enquiry Concerning Human Understanding. Routledge, London (1748)
Book Google Scholar
Just, R., Jalali, D., Inozemtseva, L., Ernst, M.D., Holmes, R., Fraser, G.: Are mutants a valid substitute for real faults in software testing? In: ACM SIGSOFT Symposium on Foundations of Software Engineering, pp. 654–665 (2014)
Kaner, C., Bach, J., Pettichord, B.: Lessons Learned in Software Testing: A Context-Driven Approach. Wiley, Hoboken (2001)
Google Scholar
Kroening, D., Clarke, E.M., Lerda, F.: A tool for checking ANSI-C programs. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 168–176 (2004)
Kroening, D., Strichman, O.: Efficient computation of recurrence diameters. In: Verification, Model Checking, and Abstract Interpretation, pp. 298–309 (2003)
Google Scholar
Kupferman, O., Li, W., Seshia, S.: A theory of mutations with applications to vacuity, coverage, and fault tolerance. In: Formal Methods in Computer-Aided Design, pp. 1–9 (2008)
Lakatos, I.: The role of crucial experiments in science. Stud. His. Philos. Sci. Part A 4(4), 309–325 (1974)
Article Google Scholar
Lawlor, R.: quicksort.c. http://www.comp.dit.ie/rlawlor/Alg_DS/sorting/quickSort.c. Referenced April 20 (2015)
Lee, T.C., Hsiung, P.A.: Mutation coverage estimation for model checking. In: Automated Technology for Verification and Analysis, pp. 354–368 (2004)
Chapter Google Scholar
Lipton, R.J.: Fault diagnosis of computer programs. Carnegie Mellon Univ, Technical Report (1971)
Liu, X.: muupi mutation tool. https://github.com/apepkuss/muupi (2016)
Mathur, A.P.: Foundations of Software Testing. Addison-Wesley, Boston (2012)
Google Scholar
Mathur, A.P., Wong, W.E.: An empirical comparison of data flow and mutation-based test adequacy criteria. J. Softw. Test. Verif. Reliab. 4(1), 9–31 (1994)
Article Google Scholar
McKeeman, W.: Differential testing for software. Dig. Tech. J. Dig. Equip. Corp. 10(1), 100–107 (1998)
Google Scholar
McKenney, P.E., Eggemann, D., Randhawa, R.: Improving energy efficiency on asymmetric multiprocessing systems (2013). https://www.usenix.org/system/files/hotpar13-poster8-mckenney.pdf
McKenney, P.E., Slingwine, J.D.: Read-copy update: Using execution history to solve concurrency problems. In: Parallel and Distributed Computing and Systems, pp. 509–518. Las Vegas, NV (1998)
McKenney, P.E.: RCU Linux usage (2006). Available: http://www.rdrop.com/users/paulmck/RCU/linuxusage.html (Viewed January 14, 2007)
McKenney, P.E.: RCU torture test operation. https://www.kernel.org/doc/Documentation/RCU/torture.txt
McKenney, P.E.: Re: [PATCH fyi] RCU: the bloatwatch edition (2009). Available: http://lkml.org/lkml/2009/1/14/449 (Viewed January 15, 2009)
McKenney, P.E.: Verification challenge 4: Tiny RCU. http://paulmck.livejournal.com/39343.html (2015a)
McKenney, P.E.: Verification challenge 5: Uses of RCU. http://paulmck.livejournal.com/39793.html (2015b)
McKenney, P.E.: Structured deferral: synchronization via procrastination. Commun. ACM 56(7), 40–49 (2013). https://doi.org/10.1145/2483852.2483867
Article Google Scholar
Murugesan, A., Whalen, M.W., Rungta, N., Tkachuk, O., Person, S., Heimdahl, M.P.E., You, D.: Are we there yet? determining the adequacy of formalized requirements and test suites. In: NASA Formal Methods Symposium, pp. 279–294 (2015)
Google Scholar
Offutt, A.J., Voas, J.M.: Subsumption of condition coverage techniques by mutation testing. In: Technical Report ISSE-TR-96-01, Information and Software Systems Engineering, George Mason University (1996)
Papadakis, M., Jia, Y., Harman, M., Traon, Y.L.: Trivial compiler equivalence: A large scale empirical study of a simple fast and effective equivalent mutant detection technique. In: International Conference on Software Engineering (2015)
Popper, K.: The Logic of Scientific Discovery. Routledge, Hutchinson (1959)
MATH Google Scholar
Popper, K.: Conjectures and Refutations: The Growth of Scientific Knowledge. Routledge, London (1963)
Google Scholar
Schuler, D., Zeller, A.: Assessing oracle quality with checked coverage. In: International Conference on Software Testing, Verification and Validation, pp. 90–99 (2011)
Schuler, D., Zeller, A.: Checked coverage: an indicator for oracle quality. Softw. Test. Verif. Reliab. 23(7), 531–551 (2013)
Article Google Scholar
scvalex: Finding all paths of minimum length to a node using dijkstras algorithm. https://compprog.wordpress.com/2008/01/17/finding-all-paths-of-minimum-length-to-a-node-using-dijkstras-algorithm/ (2008)
Stout, R.: If Death Ever Slept. Viking (1957)
Strichman, O., Godlin, B.: Regression verification-a practical way to verify programs. Verified Software: Theories, Tools, Experiments pp. 496–501 (2008)
Chapter Google Scholar
Tassarotti, J., Dreyer, D., Vafeiadis, V.: Verifying read-copy-update in a logic for weak memory. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (2015). (To appear)
Tip, F.: A survey of program slicing techniques. J. Program. Lang. 3, 121–189 (1995)
Google Scholar
visar: [SOLVED] doubly linked list insertion sort in C. http://www.linuxquestions.org/questions/programming-9/doubly-linked-list-insertion-sort-in-c-4175415860/ (2012)
Visser, W., Havelund, K., Brat, G., Park, S., Lerda, F.: Model checking programs. Autom. Softw. Eng. 10(2), 203–232 (2003)
Article Google Scholar
Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in C compilers. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 283–294 (2011)
Zhang, X., Gupta, R., Zhang, Y.: Precise dynamic slicing algorithms. In: International Conference on Software Engineering, pp. 319–329 (2003)

Download references

Acknowledgements

A portion of this work was funded by NSF Grants CCF-1217824 and CCF-1054786.

Author information

Authors and Affiliations

School of Informatics, Computing and Cyber Systems, Northern Arizona University, Flagstaff, USA
Alex Groce & Josie Holmes
School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, USA
Iftekhar Ahmed & Carlos Jensen
IBM Linux Technology Center, Beaverton, USA
Paul E. McKenney

Authors

Alex Groce
View author publications
You can also search for this author in PubMed Google Scholar
Iftekhar Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Jensen
View author publications
You can also search for this author in PubMed Google Scholar
Paul E. McKenney
View author publications
You can also search for this author in PubMed Google Scholar
Josie Holmes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alex Groce.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Groce, A., Ahmed, I., Jensen, C. et al. How verified (or tested) is my code? Falsification-driven verification and testing. Autom Softw Eng 25, 917–960 (2018). https://doi.org/10.1007/s10515-018-0240-y

Download citation

Received: 11 December 2017
Accepted: 27 June 2018
Published: 11 July 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s10515-018-0240-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

How verified (or tested) is my code? Falsification-driven verification and testing

Abstract

Access this article

Similar content being viewed by others

Automatic software refactoring: a systematic literature review

A systematic review of fuzzing

An empirical study of automated unit test generation for Python

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

How verified (or tested) is my code? Falsification-driven verification and testing

Abstract

Access this article

Similar content being viewed by others

Automatic software refactoring: a systematic literature review

A systematic review of fuzzing

An empirical study of automated unit test generation for Python

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation