Skip to main content
Log in

Do code review measures explain the incidence of post-release defects?

Case study replications and bayesian networks

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Aim

In contrast to studies of defects found during code review, we aim to clarify whether code review measures can explain the prevalence of post-release defects.

Method

We replicate McIntosh et al.’s (Empirical Softw. Engg. 21(5): 2146–2189, 2016) study that uses additive regression to model the relationship between defects and code reviews. To increase external validity, we apply the same methodology on a new software project. We discuss our findings with the first author of the original study, McIntosh. We then investigate how to reduce the impact of correlated predictors in the variable selection process and how to increase understanding of the inter-relationships among the predictors by employing Bayesian Network (BN) models.

Context

As in the original study, we use the same measures authors obtained for Qt project in the original study. We mine data from version control and issue tracker of Google Chrome and operationalize measures that are close analogs to the large collection of code, process, and code review measures used in the replicated the study.

Results

Both the data from the original study and the Chrome data showed high instability of the influence of code review measures on defects with the results being highly sensitive to variable selection procedure. Models without code review predictors had as good or better fit than those with review predictors. Replication, however, confirms with the bulk of prior work showing that prior defects, module size, and authorship have the strongest relationship to post-release defects. The application of BN models helped explain the observed instability by demonstrating that the review-related predictors do not affect post-release defects directly and showed indirect effects. For example, changes that have no review discussion tend to be associated with files that have had many prior defects which in turn increase the number of post-release defects. We hope that similar analyses of other software engineering techniques may also yield a more nuanced view of their impact. Our replication package including our data and scripts is publicly available (Krutauz et al. 2020).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Machine learning methods focused on maximising prediction performance are widely used for defect prediction, but such methods are typically not transparent enough to test scientific hypothesis (Lin et al. 2013)

  2. https://scitools.com/

  3. https://www.chromium.org/developers/calendar

  4. https://cran.r-project.org/web/packages/Hmisc/Hmisc.pdf

  5. A generative model specifies a joint probability distribution over all observed variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the predictor variables. Thus, while a discriminative model allows only sampling of the target variables conditional on the predictors, a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, and consequently, to gain an understanding of the underlying mechanics of a system, generative models are essential.

References

  • Almqvist JPF (2006) Replication of controlled experiments in empirical software engineering-a survey http://lup.lub.lu.se/student-papers/record/1330459

  • Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: 2011 33rd International Conference on Software Engineering (ICSE), pages 1–10. IEEE

  • Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. In: International Symposium on Empirical Software Engineering, pp 8–17

  • Austin P, Tu J (2004) Automated variable selection methods for logistic regression result in unstable models for predicting ami mortality. Journal of clinical epidemiology 57:1138–46, 12

    Google Scholar 

  • Axelrod R (1997) Advancing the art of simulation in the social sciences. In: Simulating social phenomena, pages 21–40. Springer

  • Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: Proceedings of the International Conference on Software Engineering, pages 712–721, IEEE Press, 2013

  • Bai CG (2005) Bayesian network based software reliability prediction with an operational profile. J Syst Softw 77(2):103–112

    MathSciNet  Google Scholar 

  • Beller M, Bacchelli A, Zaidman A, Juergens E (2014) Modern code reviews in open-source projects: Which problems do they fix?. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR, pages 202–211, New York, NY, USA, 2014. ACM.

  • Bibi S, Stamelos I, Angelis L (2003) Bayesian belief networks as a software productivity estimation tool. In: 1st Balkan Conference in Informatics, Thessaloniki, Greece

  • Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 4–14. ACM

  • Bosu A, Carver JC, Bird C, Orbeck J, Chockley C (2017) Process aspects and social dynamics of contemporary code review: Insights from open source development and industrial practice at microsoft. IEEE Trans Softw Eng 43(1):56–75

    Google Scholar 

  • Bosu A, Carver JC, Hafiz M, Hilley P, Janni D (2014) Identifying the characteristics of vulnerable code changes: An empirical study. In: Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 257–268, New York, NY, USA, ACM

  • Bosu A, Greiler M, Bird C (2015) Characteristics of useful code reviews: an empirical study at microsoft. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp 146–156

  • Camilo F, Meneely A, Nagappan M (2015) Do bugs foreshadow vulnerabilities? a study of the chromium project. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp 269–279

  • Carvalho AM (2009) Scoring functions for learning bayesian networks. Inesc-id Tec, Rep, pp 12

  • Carver JC (2010) Towards reporting guidelines for experimental replications: A proposal. In: 1st international workshop on replication in empirical software engineering, pages 2–5. Citeseer

  • Carver R (1978) The case against statistical significance testing. Harv Educ Rev 48(3):378–399

    Google Scholar 

  • Chlebus BS, Nguyen SH (1998) On finding optimal discretizations for two attributes. In: International Conference on Rough Sets and Current Trends in Computing, pages 537–544. Springer

  • Dey T, Mockus A (2020) Deriving a usage-independent software quality metric. Empir Softw Eng 25(2):1596–1641

    Google Scholar 

  • Eick SG, Loader CR, Long MD, Votta LG, Wiel SV (1992) Estimating software fault content before coding. In: Proceedings of the 14th International Conference on Software Engineering, pp 59–65

  • Fagan M (2002) A history of software inspections. In: Software pioneers, pages 562–573. Springer

  • Fagan ME (1976) Design and code inspections to reduce errors in program development. IBM Syst J 15(3):182–211

    Google Scholar 

  • Fenton N, Krause P, Neil M (2002) Software measurement: Uncertainty and causal modeling. IEEE software 19(4):116–122

    Google Scholar 

  • Fenton N, Neil M, Marsh W, Hearty P, Marquez D, Krause P, Mishra R (2007) Predicting software defects in varying development lifecycles using bayesian nets. Information and Software Technology 49(1):32–43

    Google Scholar 

  • Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675–689

    Google Scholar 

  • Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Transactions on software engineering 25(5):675–689

    Google Scholar 

  • Fleiss JL (1981) The measurement of interrater agreement. Statistical methods for rates and proportions 2(212-236):22–23

    Google Scholar 

  • Friedman N, Goldszmidt M, Wyner A (1999) Data analysis with bayesian networks: A bootstrap approach. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 196–205. Morgan Kaufmann Publishers Inc.

  • Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F (2013) A survey of discretization techniques:, Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25(4):734–750

    Google Scholar 

  • Gómez OS, Juristo N, Vegas S (2014) Understanding replication of experiments in software engineering: a classification. Inf Softw Technol 56(8):1033–1048

    Google Scholar 

  • Gousios G, Zaidman A, Storey M-A, van Deursen A (2015) Work practices and challenges in pull-based development: The integrator’s perspective. In: Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 358–368, Piscataway, NJ, USA, IEEE Press

  • Graves TL, Karr AF, Marron JS, H. Siy (2000) Predicting fault incidence using software change history. Software Engineering IEEE Transactions on 26(7):653–661

    Google Scholar 

  • Harrell Jr. FE (2013) rms: Regression modeling strategies. r package version 4.0-0 City

  • Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st International Conference on Software Engineering, pages 78–88 IEEE Computer Society

  • Heckerman D (1998) A tutorial on learning with bayesian networks. In: Learning in graphical models, pages 301–354. Springer

  • Herbsleb JD, Mockus A (2003) An empirical study of speed and communication in globally-distributed software development. IEEE Trans Softw Eng 29 (6):481–494

    Google Scholar 

  • Højsgaard S (2012) Graphical independence networks with the grain package for r. J Stat Softw 46(10):1–26

    Google Scholar 

  • Huang L, Boehm B (2006) How much software quality investment is enough: a value-based approach. IEEE software 23(5):88–95

    Google Scholar 

  • James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning, volume 112 Springer

  • Knight JC, Myers EA (1993) An improved inspection technique. ACM Communications 36(11):51–61

    Google Scholar 

  • Kollanus S, Koskinen J (2009) Survey of software inspection research. Open Software Engineering Journal 3:15–34

    Google Scholar 

  • Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques

  • Kononenko O, Baysal O, Godfrey MW (2016) Code review quality: How developers see it. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp 1028–1038

  • Kononenko O, Baysal O, Guerrouj L, Cao Y, Godfrey MW (2015) Investigating code review quality: Do people and participation matter?. In: Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on, pages 111–120. IEEE

  • Krutauz A, Dey T, Rigby PC, Mockus A (2020) Replication Package for Do Code Review Measures Explain the Incidence of Post-Release Defects? https://github.com/CESEL/ReviewPostReleaseDefectsReplication

  • Laitenberger O, DeBaud J (2000) An encompassing life cycle centric survey of software inspection. J Syst Softw 50(1):5–31

    Google Scholar 

  • Landis JR, Koch GG (1977) An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33:363–374

    MATH  Google Scholar 

  • Lauritzen SL, Spiegelhalter DJ (1988) Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society:, Series B (Methodological) 50(2):157–194

    MathSciNet  MATH  Google Scholar 

  • Lin M, Lucas JrH. C., Shmueli G (2013) Research commentary—too big to fail: large samples and the p-value problem. Inf Syst Res 24(4):906–917

    Google Scholar 

  • Martin J, Tsai WT (1990) N-fold inspection: a requirements analysis technique. ACM Communications 33(2):225–232

    Google Scholar 

  • McIntosh S, Kamei Y, Adams B, Hassan AE (2014) The impact of code review coverage and code review participation on software quality: A case study of the qt, vtk, and itk projects. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pages 192–201 ACM

  • Mcintosh S, Kamei Y, Adams B, Hassan AE (2016) An empirical study of the impact of modern code review practices on software quality. Empirical Softw. Engg. 21(5):2146–2189

    Google Scholar 

  • Menzies T, Brady A, Keung J, Hihn J, Williams S, El-Rawas O, Green P, Boehm B (2013) Learning project management decisions: a case study with case-based reasoning versus data farming. IEEE Trans Softw Eng 39 (12):1698–1713

    Google Scholar 

  • Mockus A (2010) Organizational volatility and its effects on software defects. In: ACM SIGSOFT / FSE, pages 117–126, Santa Fe New Mexico, November, pp 7–11

  • Mockus A (2014) Engineering big data solutions. In: ICSE’14 FOSE, pp 85–99

  • Mockus A, Fielding RT, Herbsleb J (2000) A case study of open source software development: the apache server. In: Proceedings of the 22nd international conference on Software engineering, pages 263–272. Acm

  • Morales R, McIntosh S, Khomh F (2015) Do code review practices impact design quality? a case study of the qt, vtk, and itk projects. In: Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on, pages 171–180. IEEE

  • Mukadam M, Bird C, Rigby PC (2013) Gerrit software code review data from android. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp 45–48

  • Munaiah N, Camilo F, Wigham W, Meneely A, Nagappan M (2017) Do bugs foreshadow vulnerabilities? an in-depth study of the chromium project. Empir Softw Eng 22(3):1305–1347

    Google Scholar 

  • Nagappan N, Murphy B, Basili VR (2008) The influence of organizational structure on software quality: an empirical case study. In: ICSE, 2008, pp 521–530

  • Neil M, Fenton N (1996) Predicting software quality using bayesian belief networks. In: Proceedings of the 21st Annual Software Engineering Workshop, pages 217–230 NASA Goddard Space Flight Centre

  • Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Inproceedings of the 14th ACM conference on Computer and communications security, pages 529–540 ACM

  • Okutan A, Yıldız OT (2014) Software defect prediction using bayesian networks. Empir Softw Eng 19(1):154–181

    Google Scholar 

  • Pai GJ, Dugan JB (2007) Empirical analysis of software fault content and fault proneness using bayesian methods. IEEE Transactions on software Engineering 33(10):675–686

    Google Scholar 

  • Pear J (2014) Probabilistic reasoning in intelligent systems: networks of plausible inference

  • Pendharkar PC, Subramanian GH, Rodger JA (2005) A probabilistic model for predicting software development effort. IEEE Transactions on software engineering 31(7):615–624

    Google Scholar 

  • Perez A, Larranaga P, Inza I (2006) Supervised classification with conditional gaussian networks:, Increasing the structure complexity from naive bayes. International Journal of Approximate Reasoning 43(1):1–25

    MathSciNet  MATH  Google Scholar 

  • Perry D, Porter A, Wade M, Votta L, Perpich J (2002) Reducing inspection interval in large-scale software development. Software Engineering, IEEE Transactions on 28(7):695–705

    Google Scholar 

  • Pinheiro J, Bates D, DebRoy S, Sarkar D (2011) R development core team. 2010. nlme: linear and nonlinear mixed effects models. r package version 3.1-97. R Foundation for Statistical Computing Vienna

  • Porter A, Siy H, Mockus A, Votta L (1998) Understanding the sources of variation in software inspections. ACM Transactions Software Engineering Methodology 7(1):41–79

    Google Scholar 

  • Porter A, Siy H, Mockus A, Votta LG (1998) Understanding the sources of variation in software inspections ACM Transactions on Software Engineering and Methodology

  • Rahman MM, Roy CK (2014) An insight into the pull requests of github. In: proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 364–367, New York NY. USA, ACM

  • Rahman MM, Roy CK, Kula RG (2017) Predicting usefulness of code review comments using textual features and developer experience. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp 215–226

  • Rigby P, Cleary B, Painchaud F, Storey M-A, German D (2012) Contemporary peer review in action: Lessons from open source development. IEEE software 29(6):56–61

    Google Scholar 

  • Rigby PC, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 202–212 ACM

  • Rigby PC, German DM, Cowen L, Storey MA (2014) Peer review on Open-Source software projects: parameters, statistical models, and theory. ACM Transactions on Software Engineering and Methodology 23 (4):35:1–35:33

    Google Scholar 

  • Rigby PC, German DM, Storey M-A (2008) Open Source Software Peer Review Practices: A Case Study of the Apache Server. In: ICSE ’08: Proceedings of the 30th International Conference on Software engineering, pages 541–550, New York, NY, USA, ACM

  • Rigby PC, Storey MA (2011) Understanding broadcast based peer review on open source software projects. In: Inproceedings of the 33rd International Conference on Software Engineering, pages 541–550 ACM

  • Runeson P, Höst M (2009) Guidelines for conducting and reporting case study research in software engineering. Empirical software engineering 14(2):131

    Google Scholar 

  • Runeson P, Höst M. (2009) Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Engg. 14(2):131–164

    Google Scholar 

  • RTI (2002) The Economic Impacts of Inadequate Infrastructure for Software Testing. National Institute of Standards and Technology, USA

  • Sauer C, Jeffery DR, Land L, Yetton P (2000) The effectiveness of software development technical reviews:, a behaviorally motivated program of research. IEEE Transactions on Software Engineering 26(1):1–14

    Google Scholar 

  • Sauer C, Jeffery DR, Land L, Yetton P (2000) The Effectiveness of Software Development Technical Reviews:, A Behaviorally Motivated Program of Research. IEEE Transactions Software Engineering 26(1):1–14

    Google Scholar 

  • Scutari M (2013) Learning bayesian networks in r, an example in systems biology. http://www.bnlearn.com/about/slides/slides-useRconf13.pdf

  • Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng 39 (4):552–569

    Google Scholar 

  • Shmueli G (2010) To explain or to predict?. Statistical science, pp 289–310

  • Shull F, Basili V, Carver J, Maldonado JC, Travassos GH, Mendonça M, Fabbri S (2002) Replicating software engineering experiments: addressing the tacit knowledge problem. In: Empirical Software Engineering, 2002. Proceedings. 2002 International Symposium n, pages 7–16. IEEE

  • Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empirical software engineering 13 (2):211–218

    Google Scholar 

  • Sjoberg DI, Yamashita A, Anda B, Mockus A, Dyba T (2013) Quantifying the effect of code smells on maintenance effort. IEEE Trans Softw Eng 39 (8):1144–1156

    Google Scholar 

  • Sober E (2002) Instrumentalism, parsimony, and the akaike framework. Philos Sci 69(S3):S112–S123

    MathSciNet  Google Scholar 

  • Stamelos I, Angelis L, Dimou P, Sakellaris E (2003) On the use of bayesian belief networks for the prediction of software productivity. Inf Softw Technol 45(1):51–60

    Google Scholar 

  • Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pages 1039–1050, New York, NY, USA, ACM

  • Van Koten C, Gray A (2006) An application of bayesian network for predicting object-oriented software maintainability. Inf Softw Technol 48(1):59–67

    Google Scholar 

  • Votta LG (1993) Does every inspection need a meeting?. SIGSOFT Softw Eng. Notes 18(5):107–114

    Google Scholar 

  • Wiegers KE (2001) peer reviews in software: a practical guide. Addison-wesley information technology series Addison-Wesley

  • Yu Y, Wang H, Filkov V, Devanbu P, Vasilescu B (2015) Wait for it: Determinants of pull request evaluation latency on github. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp 367–371

  • Zheng Q, Mockus A, Zhou M (2015) A method to identify and correct problematic software activity data: Exploiting capacity constraints and data redundancies. In: ESEC/FSE’15, pages 637–648, Bergamo, Italy, ACM

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter C. Rigby.

Additional information

Communicated by: Tim Menzies

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Krutauz, A., Dey, T., Rigby, P.C. et al. Do code review measures explain the incidence of post-release defects?. Empir Software Eng 25, 3323–3356 (2020). https://doi.org/10.1007/s10664-020-09837-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-020-09837-4

Keywords

Navigation