Abstract
Aim
In contrast to studies of defects found during code review, we aim to clarify whether code review measures can explain the prevalence of post-release defects.
Method
We replicate McIntosh et al.’s (Empirical Softw. Engg. 21(5): 2146–2189, 2016) study that uses additive regression to model the relationship between defects and code reviews. To increase external validity, we apply the same methodology on a new software project. We discuss our findings with the first author of the original study, McIntosh. We then investigate how to reduce the impact of correlated predictors in the variable selection process and how to increase understanding of the inter-relationships among the predictors by employing Bayesian Network (BN) models.
Context
As in the original study, we use the same measures authors obtained for Qt project in the original study. We mine data from version control and issue tracker of Google Chrome and operationalize measures that are close analogs to the large collection of code, process, and code review measures used in the replicated the study.
Results
Both the data from the original study and the Chrome data showed high instability of the influence of code review measures on defects with the results being highly sensitive to variable selection procedure. Models without code review predictors had as good or better fit than those with review predictors. Replication, however, confirms with the bulk of prior work showing that prior defects, module size, and authorship have the strongest relationship to post-release defects. The application of BN models helped explain the observed instability by demonstrating that the review-related predictors do not affect post-release defects directly and showed indirect effects. For example, changes that have no review discussion tend to be associated with files that have had many prior defects which in turn increase the number of post-release defects. We hope that similar analyses of other software engineering techniques may also yield a more nuanced view of their impact. Our replication package including our data and scripts is publicly available (Krutauz et al. 2020).
Similar content being viewed by others
Notes
Machine learning methods focused on maximising prediction performance are widely used for defect prediction, but such methods are typically not transparent enough to test scientific hypothesis (Lin et al. 2013)
A generative model specifies a joint probability distribution over all observed variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the predictor variables. Thus, while a discriminative model allows only sampling of the target variables conditional on the predictors, a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, and consequently, to gain an understanding of the underlying mechanics of a system, generative models are essential.
References
Almqvist JPF (2006) Replication of controlled experiments in empirical software engineering-a survey http://lup.lub.lu.se/student-papers/record/1330459
Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: 2011 33rd International Conference on Software Engineering (ICSE), pages 1–10. IEEE
Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. In: International Symposium on Empirical Software Engineering, pp 8–17
Austin P, Tu J (2004) Automated variable selection methods for logistic regression result in unstable models for predicting ami mortality. Journal of clinical epidemiology 57:1138–46, 12
Axelrod R (1997) Advancing the art of simulation in the social sciences. In: Simulating social phenomena, pages 21–40. Springer
Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: Proceedings of the International Conference on Software Engineering, pages 712–721, IEEE Press, 2013
Bai CG (2005) Bayesian network based software reliability prediction with an operational profile. J Syst Softw 77(2):103–112
Beller M, Bacchelli A, Zaidman A, Juergens E (2014) Modern code reviews in open-source projects: Which problems do they fix?. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR, pages 202–211, New York, NY, USA, 2014. ACM.
Bibi S, Stamelos I, Angelis L (2003) Bayesian belief networks as a software productivity estimation tool. In: 1st Balkan Conference in Informatics, Thessaloniki, Greece
Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 4–14. ACM
Bosu A, Carver JC, Bird C, Orbeck J, Chockley C (2017) Process aspects and social dynamics of contemporary code review: Insights from open source development and industrial practice at microsoft. IEEE Trans Softw Eng 43(1):56–75
Bosu A, Carver JC, Hafiz M, Hilley P, Janni D (2014) Identifying the characteristics of vulnerable code changes: An empirical study. In: Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 257–268, New York, NY, USA, ACM
Bosu A, Greiler M, Bird C (2015) Characteristics of useful code reviews: an empirical study at microsoft. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp 146–156
Camilo F, Meneely A, Nagappan M (2015) Do bugs foreshadow vulnerabilities? a study of the chromium project. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp 269–279
Carvalho AM (2009) Scoring functions for learning bayesian networks. Inesc-id Tec, Rep, pp 12
Carver JC (2010) Towards reporting guidelines for experimental replications: A proposal. In: 1st international workshop on replication in empirical software engineering, pages 2–5. Citeseer
Carver R (1978) The case against statistical significance testing. Harv Educ Rev 48(3):378–399
Chlebus BS, Nguyen SH (1998) On finding optimal discretizations for two attributes. In: International Conference on Rough Sets and Current Trends in Computing, pages 537–544. Springer
Dey T, Mockus A (2020) Deriving a usage-independent software quality metric. Empir Softw Eng 25(2):1596–1641
Eick SG, Loader CR, Long MD, Votta LG, Wiel SV (1992) Estimating software fault content before coding. In: Proceedings of the 14th International Conference on Software Engineering, pp 59–65
Fagan M (2002) A history of software inspections. In: Software pioneers, pages 562–573. Springer
Fagan ME (1976) Design and code inspections to reduce errors in program development. IBM Syst J 15(3):182–211
Fenton N, Krause P, Neil M (2002) Software measurement: Uncertainty and causal modeling. IEEE software 19(4):116–122
Fenton N, Neil M, Marsh W, Hearty P, Marquez D, Krause P, Mishra R (2007) Predicting software defects in varying development lifecycles using bayesian nets. Information and Software Technology 49(1):32–43
Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675–689
Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Transactions on software engineering 25(5):675–689
Fleiss JL (1981) The measurement of interrater agreement. Statistical methods for rates and proportions 2(212-236):22–23
Friedman N, Goldszmidt M, Wyner A (1999) Data analysis with bayesian networks: A bootstrap approach. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 196–205. Morgan Kaufmann Publishers Inc.
Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F (2013) A survey of discretization techniques:, Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25(4):734–750
Gómez OS, Juristo N, Vegas S (2014) Understanding replication of experiments in software engineering: a classification. Inf Softw Technol 56(8):1033–1048
Gousios G, Zaidman A, Storey M-A, van Deursen A (2015) Work practices and challenges in pull-based development: The integrator’s perspective. In: Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 358–368, Piscataway, NJ, USA, IEEE Press
Graves TL, Karr AF, Marron JS, H. Siy (2000) Predicting fault incidence using software change history. Software Engineering IEEE Transactions on 26(7):653–661
Harrell Jr. FE (2013) rms: Regression modeling strategies. r package version 4.0-0 City
Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st International Conference on Software Engineering, pages 78–88 IEEE Computer Society
Heckerman D (1998) A tutorial on learning with bayesian networks. In: Learning in graphical models, pages 301–354. Springer
Herbsleb JD, Mockus A (2003) An empirical study of speed and communication in globally-distributed software development. IEEE Trans Softw Eng 29 (6):481–494
Højsgaard S (2012) Graphical independence networks with the grain package for r. J Stat Softw 46(10):1–26
Huang L, Boehm B (2006) How much software quality investment is enough: a value-based approach. IEEE software 23(5):88–95
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning, volume 112 Springer
Knight JC, Myers EA (1993) An improved inspection technique. ACM Communications 36(11):51–61
Kollanus S, Koskinen J (2009) Survey of software inspection research. Open Software Engineering Journal 3:15–34
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques
Kononenko O, Baysal O, Godfrey MW (2016) Code review quality: How developers see it. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp 1028–1038
Kononenko O, Baysal O, Guerrouj L, Cao Y, Godfrey MW (2015) Investigating code review quality: Do people and participation matter?. In: Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on, pages 111–120. IEEE
Krutauz A, Dey T, Rigby PC, Mockus A (2020) Replication Package for Do Code Review Measures Explain the Incidence of Post-Release Defects? https://github.com/CESEL/ReviewPostReleaseDefectsReplication
Laitenberger O, DeBaud J (2000) An encompassing life cycle centric survey of software inspection. J Syst Softw 50(1):5–31
Landis JR, Koch GG (1977) An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33:363–374
Lauritzen SL, Spiegelhalter DJ (1988) Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society:, Series B (Methodological) 50(2):157–194
Lin M, Lucas JrH. C., Shmueli G (2013) Research commentary—too big to fail: large samples and the p-value problem. Inf Syst Res 24(4):906–917
Martin J, Tsai WT (1990) N-fold inspection: a requirements analysis technique. ACM Communications 33(2):225–232
McIntosh S, Kamei Y, Adams B, Hassan AE (2014) The impact of code review coverage and code review participation on software quality: A case study of the qt, vtk, and itk projects. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pages 192–201 ACM
Mcintosh S, Kamei Y, Adams B, Hassan AE (2016) An empirical study of the impact of modern code review practices on software quality. Empirical Softw. Engg. 21(5):2146–2189
Menzies T, Brady A, Keung J, Hihn J, Williams S, El-Rawas O, Green P, Boehm B (2013) Learning project management decisions: a case study with case-based reasoning versus data farming. IEEE Trans Softw Eng 39 (12):1698–1713
Mockus A (2010) Organizational volatility and its effects on software defects. In: ACM SIGSOFT / FSE, pages 117–126, Santa Fe New Mexico, November, pp 7–11
Mockus A (2014) Engineering big data solutions. In: ICSE’14 FOSE, pp 85–99
Mockus A, Fielding RT, Herbsleb J (2000) A case study of open source software development: the apache server. In: Proceedings of the 22nd international conference on Software engineering, pages 263–272. Acm
Morales R, McIntosh S, Khomh F (2015) Do code review practices impact design quality? a case study of the qt, vtk, and itk projects. In: Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on, pages 171–180. IEEE
Mukadam M, Bird C, Rigby PC (2013) Gerrit software code review data from android. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp 45–48
Munaiah N, Camilo F, Wigham W, Meneely A, Nagappan M (2017) Do bugs foreshadow vulnerabilities? an in-depth study of the chromium project. Empir Softw Eng 22(3):1305–1347
Nagappan N, Murphy B, Basili VR (2008) The influence of organizational structure on software quality: an empirical case study. In: ICSE, 2008, pp 521–530
Neil M, Fenton N (1996) Predicting software quality using bayesian belief networks. In: Proceedings of the 21st Annual Software Engineering Workshop, pages 217–230 NASA Goddard Space Flight Centre
Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Inproceedings of the 14th ACM conference on Computer and communications security, pages 529–540 ACM
Okutan A, Yıldız OT (2014) Software defect prediction using bayesian networks. Empir Softw Eng 19(1):154–181
Pai GJ, Dugan JB (2007) Empirical analysis of software fault content and fault proneness using bayesian methods. IEEE Transactions on software Engineering 33(10):675–686
Pear J (2014) Probabilistic reasoning in intelligent systems: networks of plausible inference
Pendharkar PC, Subramanian GH, Rodger JA (2005) A probabilistic model for predicting software development effort. IEEE Transactions on software engineering 31(7):615–624
Perez A, Larranaga P, Inza I (2006) Supervised classification with conditional gaussian networks:, Increasing the structure complexity from naive bayes. International Journal of Approximate Reasoning 43(1):1–25
Perry D, Porter A, Wade M, Votta L, Perpich J (2002) Reducing inspection interval in large-scale software development. Software Engineering, IEEE Transactions on 28(7):695–705
Pinheiro J, Bates D, DebRoy S, Sarkar D (2011) R development core team. 2010. nlme: linear and nonlinear mixed effects models. r package version 3.1-97. R Foundation for Statistical Computing Vienna
Porter A, Siy H, Mockus A, Votta L (1998) Understanding the sources of variation in software inspections. ACM Transactions Software Engineering Methodology 7(1):41–79
Porter A, Siy H, Mockus A, Votta LG (1998) Understanding the sources of variation in software inspections ACM Transactions on Software Engineering and Methodology
Rahman MM, Roy CK (2014) An insight into the pull requests of github. In: proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 364–367, New York NY. USA, ACM
Rahman MM, Roy CK, Kula RG (2017) Predicting usefulness of code review comments using textual features and developer experience. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp 215–226
Rigby P, Cleary B, Painchaud F, Storey M-A, German D (2012) Contemporary peer review in action: Lessons from open source development. IEEE software 29(6):56–61
Rigby PC, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 202–212 ACM
Rigby PC, German DM, Cowen L, Storey MA (2014) Peer review on Open-Source software projects: parameters, statistical models, and theory. ACM Transactions on Software Engineering and Methodology 23 (4):35:1–35:33
Rigby PC, German DM, Storey M-A (2008) Open Source Software Peer Review Practices: A Case Study of the Apache Server. In: ICSE ’08: Proceedings of the 30th International Conference on Software engineering, pages 541–550, New York, NY, USA, ACM
Rigby PC, Storey MA (2011) Understanding broadcast based peer review on open source software projects. In: Inproceedings of the 33rd International Conference on Software Engineering, pages 541–550 ACM
Runeson P, Höst M (2009) Guidelines for conducting and reporting case study research in software engineering. Empirical software engineering 14(2):131
Runeson P, Höst M. (2009) Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Engg. 14(2):131–164
RTI (2002) The Economic Impacts of Inadequate Infrastructure for Software Testing. National Institute of Standards and Technology, USA
Sauer C, Jeffery DR, Land L, Yetton P (2000) The effectiveness of software development technical reviews:, a behaviorally motivated program of research. IEEE Transactions on Software Engineering 26(1):1–14
Sauer C, Jeffery DR, Land L, Yetton P (2000) The Effectiveness of Software Development Technical Reviews:, A Behaviorally Motivated Program of Research. IEEE Transactions Software Engineering 26(1):1–14
Scutari M (2013) Learning bayesian networks in r, an example in systems biology. http://www.bnlearn.com/about/slides/slides-useRconf13.pdf
Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng 39 (4):552–569
Shmueli G (2010) To explain or to predict?. Statistical science, pp 289–310
Shull F, Basili V, Carver J, Maldonado JC, Travassos GH, Mendonça M, Fabbri S (2002) Replicating software engineering experiments: addressing the tacit knowledge problem. In: Empirical Software Engineering, 2002. Proceedings. 2002 International Symposium n, pages 7–16. IEEE
Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empirical software engineering 13 (2):211–218
Sjoberg DI, Yamashita A, Anda B, Mockus A, Dyba T (2013) Quantifying the effect of code smells on maintenance effort. IEEE Trans Softw Eng 39 (8):1144–1156
Sober E (2002) Instrumentalism, parsimony, and the akaike framework. Philos Sci 69(S3):S112–S123
Stamelos I, Angelis L, Dimou P, Sakellaris E (2003) On the use of bayesian belief networks for the prediction of software productivity. Inf Softw Technol 45(1):51–60
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pages 1039–1050, New York, NY, USA, ACM
Van Koten C, Gray A (2006) An application of bayesian network for predicting object-oriented software maintainability. Inf Softw Technol 48(1):59–67
Votta LG (1993) Does every inspection need a meeting?. SIGSOFT Softw Eng. Notes 18(5):107–114
Wiegers KE (2001) peer reviews in software: a practical guide. Addison-wesley information technology series Addison-Wesley
Yu Y, Wang H, Filkov V, Devanbu P, Vasilescu B (2015) Wait for it: Determinants of pull request evaluation latency on github. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp 367–371
Zheng Q, Mockus A, Zhou M (2015) A method to identify and correct problematic software activity data: Exploiting capacity constraints and data redundancies. In: ESEC/FSE’15, pages 637–648, Bergamo, Italy, ACM
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Tim Menzies
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Krutauz, A., Dey, T., Rigby, P.C. et al. Do code review measures explain the incidence of post-release defects?. Empir Software Eng 25, 3323–3356 (2020). https://doi.org/10.1007/s10664-020-09837-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-020-09837-4