Do code review measures explain the incidence of post-release defects?

Krutauz, Andrey; Dey, Tapajit; Rigby, Peter C.; Mockus, Audris

doi:10.1007/s10664-020-09837-4

Do code review measures explain the incidence of post-release defects?

Case study replications and bayesian networks

Published: 29 June 2020

Volume 25, pages 3323–3356, (2020)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Andrey Krutauz¹,
Tapajit Dey²,
Peter C. Rigby ORCID: orcid.org/0000-0003-1137-4297¹ &
…
Audris Mockus²

532 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Aim

In contrast to studies of defects found during code review, we aim to clarify whether code review measures can explain the prevalence of post-release defects.

Method

We replicate McIntosh et al.’s (Empirical Softw. Engg. 21(5): 2146–2189, 2016) study that uses additive regression to model the relationship between defects and code reviews. To increase external validity, we apply the same methodology on a new software project. We discuss our findings with the first author of the original study, McIntosh. We then investigate how to reduce the impact of correlated predictors in the variable selection process and how to increase understanding of the inter-relationships among the predictors by employing Bayesian Network (BN) models.

Context

As in the original study, we use the same measures authors obtained for Qt project in the original study. We mine data from version control and issue tracker of Google Chrome and operationalize measures that are close analogs to the large collection of code, process, and code review measures used in the replicated the study.

Results

Both the data from the original study and the Chrome data showed high instability of the influence of code review measures on defects with the results being highly sensitive to variable selection procedure. Models without code review predictors had as good or better fit than those with review predictors. Replication, however, confirms with the bulk of prior work showing that prior defects, module size, and authorship have the strongest relationship to post-release defects. The application of BN models helped explain the observed instability by demonstrating that the review-related predictors do not affect post-release defects directly and showed indirect effects. For example, changes that have no review discussion tend to be associated with files that have had many prior defects which in turn increase the number of post-release defects. We hope that similar analyses of other software engineering techniques may also yield a more nuanced view of their impact. Our replication package including our data and scripts is publicly available (Krutauz et al. 2020).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ethics in the Software Development Process: from Codes of Conduct to Ethical Deliberation

Article Open access 21 April 2021

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Sampling in software engineering research: a critical review and guidelines

Article 28 April 2022

Notes

Machine learning methods focused on maximising prediction performance are widely used for defect prediction, but such methods are typically not transparent enough to test scientific hypothesis (Lin et al. 2013)
https://scitools.com/
https://www.chromium.org/developers/calendar
https://cran.r-project.org/web/packages/Hmisc/Hmisc.pdf
A generative model specifies a joint probability distribution over all observed variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the predictor variables. Thus, while a discriminative model allows only sampling of the target variables conditional on the predictors, a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, and consequently, to gain an understanding of the underlying mechanics of a system, generative models are essential.

References

Almqvist JPF (2006) Replication of controlled experiments in empirical software engineering-a survey http://lup.lub.lu.se/student-papers/record/1330459
Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: 2011 33rd International Conference on Software Engineering (ICSE), pages 1–10. IEEE
Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. In: International Symposium on Empirical Software Engineering, pp 8–17
Austin P, Tu J (2004) Automated variable selection methods for logistic regression result in unstable models for predicting ami mortality. Journal of clinical epidemiology 57:1138–46, 12
Google Scholar
Axelrod R (1997) Advancing the art of simulation in the social sciences. In: Simulating social phenomena, pages 21–40. Springer
Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: Proceedings of the International Conference on Software Engineering, pages 712–721, IEEE Press, 2013
Bai CG (2005) Bayesian network based software reliability prediction with an operational profile. J Syst Softw 77(2):103–112
MathSciNet Google Scholar
Beller M, Bacchelli A, Zaidman A, Juergens E (2014) Modern code reviews in open-source projects: Which problems do they fix?. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR, pages 202–211, New York, NY, USA, 2014. ACM.
Bibi S, Stamelos I, Angelis L (2003) Bayesian belief networks as a software productivity estimation tool. In: 1st Balkan Conference in Informatics, Thessaloniki, Greece
Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 4–14. ACM
Bosu A, Carver JC, Bird C, Orbeck J, Chockley C (2017) Process aspects and social dynamics of contemporary code review: Insights from open source development and industrial practice at microsoft. IEEE Trans Softw Eng 43(1):56–75
Google Scholar
Bosu A, Carver JC, Hafiz M, Hilley P, Janni D (2014) Identifying the characteristics of vulnerable code changes: An empirical study. In: Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 257–268, New York, NY, USA, ACM
Bosu A, Greiler M, Bird C (2015) Characteristics of useful code reviews: an empirical study at microsoft. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp 146–156
Camilo F, Meneely A, Nagappan M (2015) Do bugs foreshadow vulnerabilities? a study of the chromium project. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp 269–279
Carvalho AM (2009) Scoring functions for learning bayesian networks. Inesc-id Tec, Rep, pp 12
Carver JC (2010) Towards reporting guidelines for experimental replications: A proposal. In: 1st international workshop on replication in empirical software engineering, pages 2–5. Citeseer
Carver R (1978) The case against statistical significance testing. Harv Educ Rev 48(3):378–399
Google Scholar
Chlebus BS, Nguyen SH (1998) On finding optimal discretizations for two attributes. In: International Conference on Rough Sets and Current Trends in Computing, pages 537–544. Springer
Dey T, Mockus A (2020) Deriving a usage-independent software quality metric. Empir Softw Eng 25(2):1596–1641
Google Scholar
Eick SG, Loader CR, Long MD, Votta LG, Wiel SV (1992) Estimating software fault content before coding. In: Proceedings of the 14th International Conference on Software Engineering, pp 59–65
Fagan M (2002) A history of software inspections. In: Software pioneers, pages 562–573. Springer
Fagan ME (1976) Design and code inspections to reduce errors in program development. IBM Syst J 15(3):182–211
Google Scholar
Fenton N, Krause P, Neil M (2002) Software measurement: Uncertainty and causal modeling. IEEE software 19(4):116–122
Google Scholar
Fenton N, Neil M, Marsh W, Hearty P, Marquez D, Krause P, Mishra R (2007) Predicting software defects in varying development lifecycles using bayesian nets. Information and Software Technology 49(1):32–43
Google Scholar
Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675–689
Google Scholar
Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Transactions on software engineering 25(5):675–689
Google Scholar
Fleiss JL (1981) The measurement of interrater agreement. Statistical methods for rates and proportions 2(212-236):22–23
Google Scholar
Friedman N, Goldszmidt M, Wyner A (1999) Data analysis with bayesian networks: A bootstrap approach. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 196–205. Morgan Kaufmann Publishers Inc.
Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F (2013) A survey of discretization techniques:, Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25(4):734–750
Google Scholar
Gómez OS, Juristo N, Vegas S (2014) Understanding replication of experiments in software engineering: a classification. Inf Softw Technol 56(8):1033–1048
Google Scholar
Gousios G, Zaidman A, Storey M-A, van Deursen A (2015) Work practices and challenges in pull-based development: The integrator’s perspective. In: Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 358–368, Piscataway, NJ, USA, IEEE Press
Graves TL, Karr AF, Marron JS, H. Siy (2000) Predicting fault incidence using software change history. Software Engineering IEEE Transactions on 26(7):653–661
Google Scholar
Harrell Jr. FE (2013) rms: Regression modeling strategies. r package version 4.0-0 City
Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st International Conference on Software Engineering, pages 78–88 IEEE Computer Society
Heckerman D (1998) A tutorial on learning with bayesian networks. In: Learning in graphical models, pages 301–354. Springer
Herbsleb JD, Mockus A (2003) An empirical study of speed and communication in globally-distributed software development. IEEE Trans Softw Eng 29 (6):481–494
Google Scholar
Højsgaard S (2012) Graphical independence networks with the grain package for r. J Stat Softw 46(10):1–26
Google Scholar
Huang L, Boehm B (2006) How much software quality investment is enough: a value-based approach. IEEE software 23(5):88–95
Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning, volume 112 Springer
Knight JC, Myers EA (1993) An improved inspection technique. ACM Communications 36(11):51–61
Google Scholar
Kollanus S, Koskinen J (2009) Survey of software inspection research. Open Software Engineering Journal 3:15–34
Google Scholar
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques
Kononenko O, Baysal O, Godfrey MW (2016) Code review quality: How developers see it. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp 1028–1038
Kononenko O, Baysal O, Guerrouj L, Cao Y, Godfrey MW (2015) Investigating code review quality: Do people and participation matter?. In: Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on, pages 111–120. IEEE
Krutauz A, Dey T, Rigby PC, Mockus A (2020) Replication Package for Do Code Review Measures Explain the Incidence of Post-Release Defects? https://github.com/CESEL/ReviewPostReleaseDefectsReplication
Laitenberger O, DeBaud J (2000) An encompassing life cycle centric survey of software inspection. J Syst Softw 50(1):5–31
Google Scholar
Landis JR, Koch GG (1977) An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33:363–374
MATH Google Scholar
Lauritzen SL, Spiegelhalter DJ (1988) Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society:, Series B (Methodological) 50(2):157–194
MathSciNet MATH Google Scholar
Lin M, Lucas JrH. C., Shmueli G (2013) Research commentary—too big to fail: large samples and the p-value problem. Inf Syst Res 24(4):906–917
Google Scholar
Martin J, Tsai WT (1990) N-fold inspection: a requirements analysis technique. ACM Communications 33(2):225–232
Google Scholar
McIntosh S, Kamei Y, Adams B, Hassan AE (2014) The impact of code review coverage and code review participation on software quality: A case study of the qt, vtk, and itk projects. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pages 192–201 ACM
Mcintosh S, Kamei Y, Adams B, Hassan AE (2016) An empirical study of the impact of modern code review practices on software quality. Empirical Softw. Engg. 21(5):2146–2189
Google Scholar
Menzies T, Brady A, Keung J, Hihn J, Williams S, El-Rawas O, Green P, Boehm B (2013) Learning project management decisions: a case study with case-based reasoning versus data farming. IEEE Trans Softw Eng 39 (12):1698–1713
Google Scholar
Mockus A (2010) Organizational volatility and its effects on software defects. In: ACM SIGSOFT / FSE, pages 117–126, Santa Fe New Mexico, November, pp 7–11
Mockus A (2014) Engineering big data solutions. In: ICSE’14 FOSE, pp 85–99
Mockus A, Fielding RT, Herbsleb J (2000) A case study of open source software development: the apache server. In: Proceedings of the 22nd international conference on Software engineering, pages 263–272. Acm
Morales R, McIntosh S, Khomh F (2015) Do code review practices impact design quality? a case study of the qt, vtk, and itk projects. In: Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on, pages 171–180. IEEE
Mukadam M, Bird C, Rigby PC (2013) Gerrit software code review data from android. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp 45–48
Munaiah N, Camilo F, Wigham W, Meneely A, Nagappan M (2017) Do bugs foreshadow vulnerabilities? an in-depth study of the chromium project. Empir Softw Eng 22(3):1305–1347
Google Scholar
Nagappan N, Murphy B, Basili VR (2008) The influence of organizational structure on software quality: an empirical case study. In: ICSE, 2008, pp 521–530
Neil M, Fenton N (1996) Predicting software quality using bayesian belief networks. In: Proceedings of the 21st Annual Software Engineering Workshop, pages 217–230 NASA Goddard Space Flight Centre
Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Inproceedings of the 14th ACM conference on Computer and communications security, pages 529–540 ACM
Okutan A, Yıldız OT (2014) Software defect prediction using bayesian networks. Empir Softw Eng 19(1):154–181
Google Scholar
Pai GJ, Dugan JB (2007) Empirical analysis of software fault content and fault proneness using bayesian methods. IEEE Transactions on software Engineering 33(10):675–686
Google Scholar
Pear J (2014) Probabilistic reasoning in intelligent systems: networks of plausible inference
Pendharkar PC, Subramanian GH, Rodger JA (2005) A probabilistic model for predicting software development effort. IEEE Transactions on software engineering 31(7):615–624
Google Scholar
Perez A, Larranaga P, Inza I (2006) Supervised classification with conditional gaussian networks:, Increasing the structure complexity from naive bayes. International Journal of Approximate Reasoning 43(1):1–25
MathSciNet MATH Google Scholar
Perry D, Porter A, Wade M, Votta L, Perpich J (2002) Reducing inspection interval in large-scale software development. Software Engineering, IEEE Transactions on 28(7):695–705
Google Scholar
Pinheiro J, Bates D, DebRoy S, Sarkar D (2011) R development core team. 2010. nlme: linear and nonlinear mixed effects models. r package version 3.1-97. R Foundation for Statistical Computing Vienna
Porter A, Siy H, Mockus A, Votta L (1998) Understanding the sources of variation in software inspections. ACM Transactions Software Engineering Methodology 7(1):41–79
Google Scholar
Porter A, Siy H, Mockus A, Votta LG (1998) Understanding the sources of variation in software inspections ACM Transactions on Software Engineering and Methodology
Rahman MM, Roy CK (2014) An insight into the pull requests of github. In: proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 364–367, New York NY. USA, ACM
Rahman MM, Roy CK, Kula RG (2017) Predicting usefulness of code review comments using textual features and developer experience. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp 215–226
Rigby P, Cleary B, Painchaud F, Storey M-A, German D (2012) Contemporary peer review in action: Lessons from open source development. IEEE software 29(6):56–61
Google Scholar
Rigby PC, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 202–212 ACM
Rigby PC, German DM, Cowen L, Storey MA (2014) Peer review on Open-Source software projects: parameters, statistical models, and theory. ACM Transactions on Software Engineering and Methodology 23 (4):35:1–35:33
Google Scholar
Rigby PC, German DM, Storey M-A (2008) Open Source Software Peer Review Practices: A Case Study of the Apache Server. In: ICSE ’08: Proceedings of the 30th International Conference on Software engineering, pages 541–550, New York, NY, USA, ACM
Rigby PC, Storey MA (2011) Understanding broadcast based peer review on open source software projects. In: Inproceedings of the 33rd International Conference on Software Engineering, pages 541–550 ACM
Runeson P, Höst M (2009) Guidelines for conducting and reporting case study research in software engineering. Empirical software engineering 14(2):131
Google Scholar
Runeson P, Höst M. (2009) Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Engg. 14(2):131–164
Google Scholar
RTI (2002) The Economic Impacts of Inadequate Infrastructure for Software Testing. National Institute of Standards and Technology, USA
Sauer C, Jeffery DR, Land L, Yetton P (2000) The effectiveness of software development technical reviews:, a behaviorally motivated program of research. IEEE Transactions on Software Engineering 26(1):1–14
Google Scholar
Sauer C, Jeffery DR, Land L, Yetton P (2000) The Effectiveness of Software Development Technical Reviews:, A Behaviorally Motivated Program of Research. IEEE Transactions Software Engineering 26(1):1–14
Google Scholar
Scutari M (2013) Learning bayesian networks in r, an example in systems biology. http://www.bnlearn.com/about/slides/slides-useRconf13.pdf
Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng 39 (4):552–569
Google Scholar
Shmueli G (2010) To explain or to predict?. Statistical science, pp 289–310
Shull F, Basili V, Carver J, Maldonado JC, Travassos GH, Mendonça M, Fabbri S (2002) Replicating software engineering experiments: addressing the tacit knowledge problem. In: Empirical Software Engineering, 2002. Proceedings. 2002 International Symposium n, pages 7–16. IEEE
Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empirical software engineering 13 (2):211–218
Google Scholar
Sjoberg DI, Yamashita A, Anda B, Mockus A, Dyba T (2013) Quantifying the effect of code smells on maintenance effort. IEEE Trans Softw Eng 39 (8):1144–1156
Google Scholar
Sober E (2002) Instrumentalism, parsimony, and the akaike framework. Philos Sci 69(S3):S112–S123
MathSciNet Google Scholar
Stamelos I, Angelis L, Dimou P, Sakellaris E (2003) On the use of bayesian belief networks for the prediction of software productivity. Inf Softw Technol 45(1):51–60
Google Scholar
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pages 1039–1050, New York, NY, USA, ACM
Van Koten C, Gray A (2006) An application of bayesian network for predicting object-oriented software maintainability. Inf Softw Technol 48(1):59–67
Google Scholar
Votta LG (1993) Does every inspection need a meeting?. SIGSOFT Softw Eng. Notes 18(5):107–114
Google Scholar
Wiegers KE (2001) peer reviews in software: a practical guide. Addison-wesley information technology series Addison-Wesley
Yu Y, Wang H, Filkov V, Devanbu P, Vasilescu B (2015) Wait for it: Determinants of pull request evaluation latency on github. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp 367–371
Zheng Q, Mockus A, Zhou M (2015) A method to identify and correct problematic software activity data: Exploiting capacity constraints and data redundancies. In: ESEC/FSE’15, pages 637–648, Bergamo, Italy, ACM

Download references

Author information

Authors and Affiliations

Concordia University Montreal, Montreal, QC, Canada
Andrey Krutauz & Peter C. Rigby
University of Tennessee Knoxville, Knoxville, Tennessee, USA
Tapajit Dey & Audris Mockus

Authors

Andrey Krutauz
View author publications
You can also search for this author in PubMed Google Scholar
Tapajit Dey
View author publications
You can also search for this author in PubMed Google Scholar
Peter C. Rigby
View author publications
You can also search for this author in PubMed Google Scholar
Audris Mockus
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter C. Rigby.

Additional information

Communicated by: Tim Menzies

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Krutauz, A., Dey, T., Rigby, P.C. et al. Do code review measures explain the incidence of post-release defects?. Empir Software Eng 25, 3323–3356 (2020). https://doi.org/10.1007/s10664-020-09837-4

Download citation

Published: 29 June 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10664-020-09837-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Do code review measures explain the incidence of post-release defects?

Abstract