Topic Modeling Based Warning Prioritization from Change Sets of Software Repository

Lee, Jung-Been; Lee, Taek; In, Hoh Peter

doi:10.1007/s11390-020-0047-8

Topic Modeling Based Warning Prioritization from Change Sets of Software Repository

Regular Paper
Published: 30 November 2020

Volume 35, pages 1461–1479, (2020)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Jung-Been Lee¹,
Taek Lee² &
Hoh Peter In¹

89 Accesses
2 Citations
Explore all metrics

Abstract

Many existing warning prioritization techniques seek to reorder the static analysis warnings such that true positives are provided first. However, excessive amount of time is required therein to investigate and fix prioritized warnings because some are not actually true positives or are irrelevant to the code context and topic. In this paper, we propose a warning prioritization technique that reflects various latent topics from bug-related code blocks. Our main aim is to build a prioritization model that comprises separate warning priorities depending on the topic of the change sets to identify the number of true positive warnings. For the performance evaluation of the proposed model, we employ a performance metric called warning detection rate, widely used in many warning prioritization studies, and compare the proposed model with other competitive techniques. Additionally, the effectiveness of our model is verified via the application of our technique to eight industrial projects of a real global company.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ranking Source Code Static Analysis Warnings for Continuous Monitoring of FLOSS Repositories

An Empirical Study on the Persistence of SpotBugs Issues in Open-Source Software Evolution

What Information in Software Historical Repositories Do We Need to Support Software Maintenance Tasks? An Approach Based on Topic Model

References

Heckman S, Williams L. A systematic literature review of actionable alert identification techniques for automated static code analysis. Information and Software Technology, 2011, 53(4): 363-387.
Article Google Scholar
Csallner C, Smaragdakis Y, Xie T. DSD-Crasher: A hybrid analysis tool for bug finding. ACM Transactions on Software Engineering and Methodology, 2008, 17(2): Article No. 8.
Heckman S, Williams L. On establishing a benchmark for evaluating static analysis alert prioritization and classification techniques. In Proc. the 2nd ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, October 2008, pp.41-50.
Kim S, Ernst M D. Which warnings should I fix first? In Proc. the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, September 2007, pp.45-54.
Hanam Q, Tan L, Holmes R, Lam P. Finding patterns in static analysis alerts: Improving actionable alert ranking. In Proc. the 11th ACM Working Conference on Mining Software Repositories, May 2014, pp.152-161
Kim S, Ernst M D. Prioritizing warning categories by analyzing software history. In Proc. the 4th International Workshop on Mining Software Repositories, May 2007, Article No. 27.
Corley C S, Damevski K, Kraft N A. Changeset-based topic modeling of software repositories. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2018.2874960.
Corley C S, Kashuda K L, Kraft N A. Modeling changeset topics for feature location. In Proc. the 31st IEEE International Conference on Software Maintenance and Evolution, September 2015, pp.71-80.
Rama G M, Sarkar S, Heafield K. Mining business topics in source code using latent Dirichlet allocation. In Proc. the 1st Annual India Software Engineering Conference, February 2008, pp.113-120.
Savage T, Dit B, Gethers M, Poshyvanyk D. Topic_XP: Exploring topics in source code using latent Dirichlet allocation. In Proc. the 26th IEEE International Conference on Software Maintenance, September 2010.
Lukins S K, Kraft N A, Etzkorn L H. Bug localization using latent Dirichlet allocation. Information and Software Technology, 2010, 52(9): 972-990.
Article Google Scholar
Nguyen A T, Nguyen T T, Al-Kofahi J, Nguyen H V, Nguyen T N. A topic-based approach for narrowing the search space of buggy files from a bug report. In Proc. the 26th IEEE/ACM International. Conference on Automated Software Engineering, November 2011, pp.263-272.
Biggers L R, Bocovich C, Capshaw R, Eddy B P, Etzkorn L H, Kraft N A. Configuring latent Dirichlet allocation based feature location. Empirical Software Engineering, 2014, 19(3): 465-500.
Article Google Scholar
Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P. Mining concepts from code with probabilistic topic models. In Proc. the 22nd IEEE/ACM International Conference on Automated Software Engineering, November 2007, pp.461-464.
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.
MATH Google Scholar
Mockus A, Votta L G. Identifying reasons for software changes using historic databases. In Proc. the 16th International Conference on Software Maintenance, October 2000, pp.120-130.
Witten I, Frank E, Hall M, Pal C. Data Mining: Practical Machine Learning Tools and Techniques (4th edition). Morgan Kaufmann, 2016.
Ponweiser M. Latent Dirichlet allocation in R [M.S. Thesis]. Vienna University of Economics and Business, 2012.
Chang J, Gerrish S, Wang C, Boyd-Graber J L, Blei D M. Reading tea leaves: How humans interpret topic models. In Proc. the 23rd Annual Conference on Neural Information Processing Systems, December 2009, pp.288-296.
Wang J, Wang S, Wang Q. Is there a “golden” feature set for static warning identification? An experimental evaluation. In Proc. the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, October 2018, Article No. 17.
Uddin J, Ghazali J, DerisM M, Naseem R, Shah S. A survey on bug prioritization. Artificial Intelligence Review, 2017, 47(2): 145-180.
Article Google Scholar
Rahman F, Posnett D, Hindle A, Barr E, Devanbu P. Bug-Cache for inspections: Hit or miss? In Proc. the 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering and 13th European Software Engineering Conference, September 2011, pp.322-331.
Hata H, Mizuno O, Kikuno T. Bug prediction based on fine-grained module histories. In Proc. the 34th International Conference on Software Engineering, June 2012, pp.200-210.
Koru A G, Emam K E, Zhang D, Liu H, Mathew D. Theory of relative defect proneness. Empirical Software Engineering, 2008, 13(5): 473-498.
Article Google Scholar
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A. Defect prediction from static code features: Current results, limitations, new approaches. Automated Software Engineering, 2010, 17(4): 375-407.
Article Google Scholar
Arisholm E, Briand L C, Johannessen E B. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software, 2010, 83(1): 2-17.
Article Google Scholar
Mende T, Koschke R. Effort-aware defect prediction models. In Proc. the 14th European Conference on Software Maintenance and Reengineering, March 2010, pp.107-116.
AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive topic models for mining text streams with applications to topic detection and tracking. In Proc. the 8th IEEE International Conference on Data Mining, December 2008, pp.3-12.
Canini K, Shi L, Griffiths T. Online inference of topics with latent Dirichlet allocation. In Proc. the 12th International Conference on Artificial Intelligence and Statistics, April 2009, pp.65-72.
Hoffman M, Bach F R, Blei D M. Online learning for latent Dirichlet allocation. In Proc. the 24th Annual Conference on Neural Information Processing Systems, December 2010, pp.856-864.
Deerwester S, Dumais S T, Furnas G W, Landauer T K, Harshman R. Indexing by latent semantic analysis. Journal of the Association for Information Science and Technology, 1990, 41(6): 391-407.
Google Scholar
Hofmann T. Probabilistic latent semantic indexing. In Proc. the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 1999, pp.50-57.
Steyvers M, Griffiths T. Probabilistic topic models. In Handbook of Latent Semantic Analysis, Landauer T, Mc-Namara D, Dennis S, Kintsch W (eds.), Psychology Press, 2007, pp.424-440.
Thomas SW. Mining software repositories using topic models. In Proc. the 33rd International Conference on Software Engineering, May 2011, pp.1138-1139.
Sun X, Li B, Leung H, Li B, Li Y. MSR4SM: Using topic models to effectively mining software repositories for software maintenance tasks. Information and Software Technology, 2015, 66: 1-12.
Article Google Scholar
Kuhn A, Ducasse S, Gîrba T. Semantic clustering: Identifying topics in source code. Information and Software Technology, 2007, 49(3): 230-243.
Article Google Scholar
Zhang W, Cui Y, Yoshida T. En-LDA: An novel approach to automatic bug report assignment with entropy optimized latent Dirichlet allocation. Entropy, 2017, 19(5): Article No. 173.
Moin A, Neumann G. Assisting bug triage in large open source projects using approximate string matching. In Proc. the 7th International Conference on Software Engineering Advances, November 2012.
Murphy G, Cubranic D. Automatic bug triage using text categorization. In Proc. the 16th International Conference on Software Engineering and Knowledge Engineering, June 2004, pp.92-97.
Jeong G, Kim S, Zimmermann T. Improving bug triage with bug tossing graphs. In Proc. the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, August 2009, pp.111-120.
Jung Y, Kim J, Shin J, Yi K. Taming false alarms from a domain-unaware C analyzer by a Bayesian statistical post analysis. In Proc. the 12th International Conference on Static Analysis, September 2005, pp.203-217.
Yi K, Choi H, Kim J, Kim Y. An empirical study on classification methods for alarms from a bug-finding static C analyzer. Information Processing Letters, 2007, 102(2/3): 118-123.
Article MathSciNet Google Scholar
Ruthruff J, Penix J, Morgenthaler J, Elbaum S, Rothermel G. Predicting accurate and actionable static analysis warnings: An experimental approach. In Proc. the 30th International Conference on Software Engineering, May 2008, pp.341-350.
Kremenek T, Engler D. Z-ranking: Using statistical analysis to counter the impact of static analysis approximations. In Proc. the 10th International Conference on Static Analysis, June 2003, pp.295-315.
Kremenek T, Ashcraft K, Yang J, Engler D. Correlation exploitation in error ranking. In Proc. the 12th ACM SIGSOFT International Symposium on Foundations of Software Engineering Notes, October 2004, pp.83-93.
Wohlin C, Runeson P, Höst M, Ohlsson M C, Regnell B, Wesslén A. Experimentation in Software Engineering. Springer Science & Business Media, 2012.
Griffiths T L, Steyvers M. Finding scientific topics. In Proc. National Academy of Sciences of the United States of America, April 2004, pp.5228-5235.

Download references

Author information

Authors and Affiliations

College of Informatics, Korea University, Seoul, 02841, Korea
Jung-Been Lee & Hoh Peter In
College of Knowledge-Based Services Engineering, Sungshin University, Seoul, 02844, Korea
Taek Lee

Authors

Jung-Been Lee
View author publications
You can also search for this author in PubMed Google Scholar
Taek Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hoh Peter In
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hoh Peter In.

Supplementary Information

ESM 1

(PDF 239 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, JB., Lee, T. & In, H.P. Topic Modeling Based Warning Prioritization from Change Sets of Software Repository. J. Comput. Sci. Technol. 35, 1461–1479 (2020). https://doi.org/10.1007/s11390-020-0047-8

Download citation

Received: 19 September 2019
Revised: 01 May 2020
Published: 30 November 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11390-020-0047-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic Modeling Based Warning Prioritization from Change Sets of Software Repository

Abstract

Access this article

Similar content being viewed by others

Ranking Source Code Static Analysis Warnings for Continuous Monitoring of FLOSS Repositories

An Empirical Study on the Persistence of SpotBugs Issues in Open-Source Software Evolution

What Information in Software Historical Repositories Do We Need to Support Software Maintenance Tasks? An Approach Based on Topic Model

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Topic Modeling Based Warning Prioritization from Change Sets of Software Repository

Abstract

Access this article

Similar content being viewed by others

Ranking Source Code Static Analysis Warnings for Continuous Monitoring of FLOSS Repositories

An Empirical Study on the Persistence of SpotBugs Issues in Open-Source Software Evolution

What Information in Software Historical Repositories Do We Need to Support Software Maintenance Tasks? An Approach Based on Topic Model

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation