Skip to main content
Log in

Topic Modeling Based Warning Prioritization from Change Sets of Software Repository

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Many existing warning prioritization techniques seek to reorder the static analysis warnings such that true positives are provided first. However, excessive amount of time is required therein to investigate and fix prioritized warnings because some are not actually true positives or are irrelevant to the code context and topic. In this paper, we propose a warning prioritization technique that reflects various latent topics from bug-related code blocks. Our main aim is to build a prioritization model that comprises separate warning priorities depending on the topic of the change sets to identify the number of true positive warnings. For the performance evaluation of the proposed model, we employ a performance metric called warning detection rate, widely used in many warning prioritization studies, and compare the proposed model with other competitive techniques. Additionally, the effectiveness of our model is verified via the application of our technique to eight industrial projects of a real global company.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Heckman S, Williams L. A systematic literature review of actionable alert identification techniques for automated static code analysis. Information and Software Technology, 2011, 53(4): 363-387.

    Article  Google Scholar 

  2. Csallner C, Smaragdakis Y, Xie T. DSD-Crasher: A hybrid analysis tool for bug finding. ACM Transactions on Software Engineering and Methodology, 2008, 17(2): Article No. 8.

  3. Heckman S, Williams L. On establishing a benchmark for evaluating static analysis alert prioritization and classification techniques. In Proc. the 2nd ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, October 2008, pp.41-50.

  4. Kim S, Ernst M D. Which warnings should I fix first? In Proc. the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, September 2007, pp.45-54.

  5. Hanam Q, Tan L, Holmes R, Lam P. Finding patterns in static analysis alerts: Improving actionable alert ranking. In Proc. the 11th ACM Working Conference on Mining Software Repositories, May 2014, pp.152-161

  6. Kim S, Ernst M D. Prioritizing warning categories by analyzing software history. In Proc. the 4th International Workshop on Mining Software Repositories, May 2007, Article No. 27.

  7. Corley C S, Damevski K, Kraft N A. Changeset-based topic modeling of software repositories. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2018.2874960.

  8. Corley C S, Kashuda K L, Kraft N A. Modeling changeset topics for feature location. In Proc. the 31st IEEE International Conference on Software Maintenance and Evolution, September 2015, pp.71-80.

  9. Rama G M, Sarkar S, Heafield K. Mining business topics in source code using latent Dirichlet allocation. In Proc. the 1st Annual India Software Engineering Conference, February 2008, pp.113-120.

  10. Savage T, Dit B, Gethers M, Poshyvanyk D. TopicXP: Exploring topics in source code using latent Dirichlet allocation. In Proc. the 26th IEEE International Conference on Software Maintenance, September 2010.

  11. Lukins S K, Kraft N A, Etzkorn L H. Bug localization using latent Dirichlet allocation. Information and Software Technology, 2010, 52(9): 972-990.

    Article  Google Scholar 

  12. Nguyen A T, Nguyen T T, Al-Kofahi J, Nguyen H V, Nguyen T N. A topic-based approach for narrowing the search space of buggy files from a bug report. In Proc. the 26th IEEE/ACM International. Conference on Automated Software Engineering, November 2011, pp.263-272.

  13. Biggers L R, Bocovich C, Capshaw R, Eddy B P, Etzkorn L H, Kraft N A. Configuring latent Dirichlet allocation based feature location. Empirical Software Engineering, 2014, 19(3): 465-500.

    Article  Google Scholar 

  14. Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P. Mining concepts from code with probabilistic topic models. In Proc. the 22nd IEEE/ACM International Conference on Automated Software Engineering, November 2007, pp.461-464.

  15. Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.

    MATH  Google Scholar 

  16. Mockus A, Votta L G. Identifying reasons for software changes using historic databases. In Proc. the 16th International Conference on Software Maintenance, October 2000, pp.120-130.

  17. Witten I, Frank E, Hall M, Pal C. Data Mining: Practical Machine Learning Tools and Techniques (4th edition). Morgan Kaufmann, 2016.

  18. Ponweiser M. Latent Dirichlet allocation in R [M.S. Thesis]. Vienna University of Economics and Business, 2012.

  19. Chang J, Gerrish S, Wang C, Boyd-Graber J L, Blei D M. Reading tea leaves: How humans interpret topic models. In Proc. the 23rd Annual Conference on Neural Information Processing Systems, December 2009, pp.288-296.

  20. Wang J, Wang S, Wang Q. Is there a “golden” feature set for static warning identification? An experimental evaluation. In Proc. the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, October 2018, Article No. 17.

  21. Uddin J, Ghazali J, DerisM M, Naseem R, Shah S. A survey on bug prioritization. Artificial Intelligence Review, 2017, 47(2): 145-180.

    Article  Google Scholar 

  22. Rahman F, Posnett D, Hindle A, Barr E, Devanbu P. Bug-Cache for inspections: Hit or miss? In Proc. the 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering and 13th European Software Engineering Conference, September 2011, pp.322-331.

  23. Hata H, Mizuno O, Kikuno T. Bug prediction based on fine-grained module histories. In Proc. the 34th International Conference on Software Engineering, June 2012, pp.200-210.

  24. Koru A G, Emam K E, Zhang D, Liu H, Mathew D. Theory of relative defect proneness. Empirical Software Engineering, 2008, 13(5): 473-498.

    Article  Google Scholar 

  25. Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A. Defect prediction from static code features: Current results, limitations, new approaches. Automated Software Engineering, 2010, 17(4): 375-407.

    Article  Google Scholar 

  26. Arisholm E, Briand L C, Johannessen E B. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software, 2010, 83(1): 2-17.

    Article  Google Scholar 

  27. Mende T, Koschke R. Effort-aware defect prediction models. In Proc. the 14th European Conference on Software Maintenance and Reengineering, March 2010, pp.107-116.

  28. AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive topic models for mining text streams with applications to topic detection and tracking. In Proc. the 8th IEEE International Conference on Data Mining, December 2008, pp.3-12.

  29. Canini K, Shi L, Griffiths T. Online inference of topics with latent Dirichlet allocation. In Proc. the 12th International Conference on Artificial Intelligence and Statistics, April 2009, pp.65-72.

  30. Hoffman M, Bach F R, Blei D M. Online learning for latent Dirichlet allocation. In Proc. the 24th Annual Conference on Neural Information Processing Systems, December 2010, pp.856-864.

  31. Deerwester S, Dumais S T, Furnas G W, Landauer T K, Harshman R. Indexing by latent semantic analysis. Journal of the Association for Information Science and Technology, 1990, 41(6): 391-407.

    Google Scholar 

  32. Hofmann T. Probabilistic latent semantic indexing. In Proc. the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 1999, pp.50-57.

  33. Steyvers M, Griffiths T. Probabilistic topic models. In Handbook of Latent Semantic Analysis, Landauer T, Mc-Namara D, Dennis S, Kintsch W (eds.), Psychology Press, 2007, pp.424-440.

  34. Thomas SW. Mining software repositories using topic models. In Proc. the 33rd International Conference on Software Engineering, May 2011, pp.1138-1139.

  35. Sun X, Li B, Leung H, Li B, Li Y. MSR4SM: Using topic models to effectively mining software repositories for software maintenance tasks. Information and Software Technology, 2015, 66: 1-12.

    Article  Google Scholar 

  36. Kuhn A, Ducasse S, Gîrba T. Semantic clustering: Identifying topics in source code. Information and Software Technology, 2007, 49(3): 230-243.

    Article  Google Scholar 

  37. Zhang W, Cui Y, Yoshida T. En-LDA: An novel approach to automatic bug report assignment with entropy optimized latent Dirichlet allocation. Entropy, 2017, 19(5): Article No. 173.

  38. Moin A, Neumann G. Assisting bug triage in large open source projects using approximate string matching. In Proc. the 7th International Conference on Software Engineering Advances, November 2012.

  39. Murphy G, Cubranic D. Automatic bug triage using text categorization. In Proc. the 16th International Conference on Software Engineering and Knowledge Engineering, June 2004, pp.92-97.

  40. Jeong G, Kim S, Zimmermann T. Improving bug triage with bug tossing graphs. In Proc. the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, August 2009, pp.111-120.

  41. Jung Y, Kim J, Shin J, Yi K. Taming false alarms from a domain-unaware C analyzer by a Bayesian statistical post analysis. In Proc. the 12th International Conference on Static Analysis, September 2005, pp.203-217.

  42. Yi K, Choi H, Kim J, Kim Y. An empirical study on classification methods for alarms from a bug-finding static C analyzer. Information Processing Letters, 2007, 102(2/3): 118-123.

    Article  MathSciNet  Google Scholar 

  43. Ruthruff J, Penix J, Morgenthaler J, Elbaum S, Rothermel G. Predicting accurate and actionable static analysis warnings: An experimental approach. In Proc. the 30th International Conference on Software Engineering, May 2008, pp.341-350.

  44. Kremenek T, Engler D. Z-ranking: Using statistical analysis to counter the impact of static analysis approximations. In Proc. the 10th International Conference on Static Analysis, June 2003, pp.295-315.

  45. Kremenek T, Ashcraft K, Yang J, Engler D. Correlation exploitation in error ranking. In Proc. the 12th ACM SIGSOFT International Symposium on Foundations of Software Engineering Notes, October 2004, pp.83-93.

  46. Wohlin C, Runeson P, Höst M, Ohlsson M C, Regnell B, Wesslén A. Experimentation in Software Engineering. Springer Science & Business Media, 2012.

  47. Griffiths T L, Steyvers M. Finding scientific topics. In Proc. National Academy of Sciences of the United States of America, April 2004, pp.5228-5235.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoh Peter In.

Supplementary Information

ESM 1

(PDF 239 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, JB., Lee, T. & In, H.P. Topic Modeling Based Warning Prioritization from Change Sets of Software Repository. J. Comput. Sci. Technol. 35, 1461–1479 (2020). https://doi.org/10.1007/s11390-020-0047-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-020-0047-8

Keywords

Navigation