A comprehensive comparative study of clustering-based unsupervised defect prediction models
Introduction
The defects hidden in software modules threaten the security and decrease the reliability of the software product. Therefore, it is essential to fix the defective modules before delivering the product.
Defect fixing is a complex and time-consuming task, and limited testing resources are usually unaffordable for supporting thorough code reviews (Geremia and Tamburri, 2018). This requests a prioritization to better analyze the software product. In other words, developers and testers should reasonably allocate the limited resources to test the modules that have a high probability to contain defects. To seek for such prioritization, Software Defect Prediction (SDP) is proposed to identify the most defect-prone modules for priority inspection. The most active SDP methods are supervised models which first train a classifier on labeled modules and then use it to determine whether or not the unlabeled modules contain defects. However, the supervised SDP models need the labeled modules of historical data of the current project or external projects which are not always available.
In order to conduct defect prediction on unlabeled data, Unsupervised Defect Prediction (UDP) models are possible for this task. As UDP models do not need any labeled data, they have attracted many researchers’ attention in recent years. There are 2 types of UDP models: Clustering-based Unsupervised Defect Prediction (CUDP) methods (such as the studies Zhong et al., 2004a, Bishnu and Bhattacherjee, 2012, Zhang et al., 2016) and Ranking-based Unsupervised Defect Prediction (RUDP) methods (such as the studies Yang et al., 2016, Fu and Menzies, 2017, Yan et al., 2017, Huang et al., 2017). RUDP methods select one feature to rank modules based on the corresponding values. The rationale behind this type of method is based on the assumption that the feature values and the defect-proneness of the modules have a direct or inverse proportional relationship (Yang et al., 2016). However, such a relationship does not exist in all features, which leads to inconsistent conclusions in previous studies. For example, Yang et al. (2016) found that RUDP methods performed significantly better than supervised models on change-level just-in-time defect data, but Yan et al. (2017) found that the conclusion in Yang et al. (2016) does not hold on a file-level benchmark dataset. Thus, more work is needed to investigate and verify the generalization of RUDP on different defect data. In addition, RUDP methods need a threshold (such as the proportion of the top-ranked modules) to divide the modules into two groups for calculating some performance indicators, such as F-measure. However, this threshold is not easy to be determined. Unlike RUDP methods, CUDP methods do not rely on the relationship between a specific feature and the defect label to rank the modules, thus avoiding the above contradictory conclusions. CUDP methods divide the modules into different groups based on a specific rule without relying on a threshold. In this work, we focused on CUDP methods and their performance on defect data with different feature sets.
The general process of CUDP methods consists of the following 2 steps: (1) leveraging a similarity metric to cluster unlabeled modules into different groups where the modules in the same group are more similar to each other compared with those in other groups. This step is based on the information found in the data that describes the relationships among the modules; (2) applying a specific strategy to annotate each group as defective or non-defective. In previous studies, researchers have applied some clustering-based methods to unlabeled defect data. For example, in early studies, researchers employed classic clustering methods like K-means algorithm (Zhong et al., 2004a) and self-organizing maps algorithm (Abaei et al., 2013) to group the modules. In more recent studies, researchers designed specific methods to cluster the modules, such as clustering and label method (Nam and Kim, 2015), and average clustering method (Yang and Qian, 2016).
There are several limitations in existing CUDP approaches: (1) there are few studies conducting a systematic literature review towards CUDP articles; (2) all previous studies focus on using existing methods or developing new methods to cluster unlabeled modules for SDP, but few studies have explored the performance differences of various clustering-based methods for UDP; (3) previous studies have shown that different feature types have impacts on the SDP performance of supervised models (Moser et al., 2008, Zimmermann and Nagappan, 2008, Radjenović et al., 2013, Kaur et al., 2015), but to our best knowledge, there is no study explored the impacts of feature types on the SDP performance of the clustering-based methods (i.e., the CUDP performance); and (4) all previous studies evaluated the CUDP performance with traditional indicators that do not consider the inspecting efforts for modules, but no study has employed the more practical effort-aware indicators.
Motivated by these limitations, in this work we conducted a large-scale empirical study to analyze the performance differences of 40 clustering-based unsupervised models (as well as 6 supervised models for comparison) on a public benchmark dataset. This dataset consists of 14 projects with a total of 27 versions in which 3 kinds of features are collected for each project. We evaluated these methods with one traditional and 2 effort-aware indicators. The experimental results show that (1) there exist significant performance differences among these methods, and the hierarchy-based, density-based, grid-based, sequence-based, and hybrid-based clustering models perform significantly worse for CUDP task in most cases; (2) some clustering-based unsupervised models, such as the instance-violation-score-based clustering methods, can achieve even better performance than the typical supervised models; (3) the CUDP performance of the methods on different indicators is affected by the feature types of the defect data; (4) the supervised models usually perform better on defect data with multiple feature types, while the phenomenon does not conform to the clustering-based unsupervised models.
The main contributions of this study include:
- (1)
We retrieved and analyzed existing SDP studies involving clustering methods from different perspectives, such as the used datasets, feature types, performance indicators, clustering methods, and labeling schemes. To the best of our knowledge, this is the first work to conduct such a detailed analysis for CUDP studies.
- (2)
We applied 40 clustering-based models from 9 clustering families to 27 project versions who have 3 types of features. In addition, we employed both traditional and effort-aware indicators to evaluate the performance of these methods. To our best knowledge, we were among the first to conduct such a wide-ranging empirical study for investigating the impacts of feature types on the CUDP performance and use both kinds of indicators for synthetically evaluating the CUDP performance.
- (3)
We designed and implemented an experimental framework which integrates 40 clustering-based unsupervised SDP models from multiple libraries. We further made the framework public available and encouraged our fellow researchers to integrate their state-of-the-art clustering models to this framework for further comparative studies.
The remainder of the paper is organized as follows: Section 2 introduces the studied 40 clustering-based unsupervised models and summaries the existing studies related to CUDP. Section 3 describes the design of our empirical study. Section 4 reports our experimental results. Section 5 discusses the implications from the experimental results and the potential validity threats. Section 6 presents different types of empirical studies in SDP domain. Section 7 concludes this paper and draws potential future directions.
Section snippets
Taxonomy for clustering-based unsupervised models
As clustering-based unsupervised models identify defective software modules without requiring the participation of labeled modules, it is meaningful to seek models that can achieve similar or better performance than supervised models for defect prediction. We briefly introduced our studied 40 unsupervised models from 9 clustering families.
Comparative methods
To investigate if there exist any clustering based models that can outperform the supervised models for SDP, we chose some representative supervised models for comparison. Although one previous study (Ghotra et al., 2015) has investigated more than 30 supervised classification models for defect prediction, it is not suitable for us to consider all these models. As Hall et al. (2011) stated that simple classification models can also perform well on SDP task, in this work we just selected 6
Results for RQ1
Since we needed to perform a total of 46 methods (40 unsupervised models and 6 supervised models) on 27 defect data with 100 times, we obtained 124200 (46 × 27 × 100) records of the performance results for this question.
Fig. 8 depicts the box-plots of 3 indicators on defect data with code complexity features. We reported both the average and median indicator values represented by the colored point and bands inside the boxes, respectively. The boxes with different colors imply distinct meanings
Implications
We provided some implications from the analysis of our experimental results for practitioners and researchers.
- (1)
The methods in the HBC, GBC, and HC families should be avoided in practice for defect prediction. The reason is that no methods from the above families perform well on all indicators and on 2 effort-aware indicators over defect data with any kind of feature types and the combined features. This may explain why the methods in these 3 families were not explored in previous studies. We
Related work
The topic of SDP has been an active research field and been widely studied in the last two decades. Recent studies on this topic can be roughly divided into 3 categories. The first category is that the researchers employed ready-made techniques or proposed new methods for SDP task in which machine learning methods are the mainstream trends. This type of studies aims to improve the performance of detecting the defective modules, such as the work in Jing et al., 2014, Xia et al., 2016 and Jing et
Conclusions
We conducted a large-scale comparison to analyze SDP performance differences among 40 clustering-based unsupervised models and 6 typical supervised models. We made the first step towards investigating the impacts of the feature types of defect data on the performance of these methods. Our experimental results on 81 defect data indicate that not all clustering-based unsupervised models are worse than the supervised models, and the performance of the methods in the IVSBC family is particularly
CRediT authorship contribution statement
Zhou Xu: Writing - original draft, Methodology, Data curation. Li Li: Conceptualization, Writing - review & editing. Meng Yan: Supervision, Formal analysis. Jin Liu: Supervision. Xiapu Luo: Writing - review & editing. John Grundy: Writing - review & editing. Yifeng Zhang: Methodology, Software, Visualization. Xiaohong Zhang: Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
The authors would like to thank Jaechang Nam for providing the source code of CLA and CLAMI. This work is supported by the National Key Research and Development Project (No. 2018YFB2101200), the National Natural Science Foundation of China (Nos. 62002034, 61972290), the Fundamental Research Funds for the Central Universities (Nos. 2020CDJQY-A021, 2020CDCGRJ072), China Postdoctoral Science Foundation (No. 2020M673137), the Natural Science Foundation of Chongqing in China (No.
Zhou Xu received the B.S. degree from Huazhong Agricultural University, China, in 2014 and the Ph.D. degree from Wuhan University, China, in 2019. He is now an Assistant Professor at the School of Big Data and Software Engineering, Chongqing University, China. His research interests include software defect prediction, feature engineering, and machine learning.
References (130)
- et al.
A systematic and comprehensive investigation of methods to build and evaluate fault prediction models
J. Syst. Softw.
(2010) - et al.
Fcm: The fuzzy c-means clustering algorithm
Comput. Geosci.
(1984) - et al.
A systematic review of software fault prediction studies
Expert Syst. Appl.
(2009) - et al.
Negative samples reduction in cross-company software defects prediction
Inf. Softw. Technol.
(2015) - et al.
Software defect number prediction: Unsupervised vs supervised methods
Inf. Softw. Technol.
(2019) - et al.
Rock: A robust clustering algorithm for categorical attributes
Inf. Syst.
(2000) The self-organizing map
Neurocomputing
(1998)- et al.
A systematic review of unsupervised learning techniques for software defect prediction
Inf. Softw. Technol.
(2020) - et al.
Using self-organizing maps to analyze object-oriented software measures
J. Syst. Softw.
(2001) - et al.
Software fault prediction metrics: A systematic literature review
Inf. Softw. Technol.
(2013)
Fault prediction by utilizing self-organizing map and threshold
Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Vol. 27
Hierarchical k-means clustering
Hierarchical clustering on principal components
Optics: ordering points to identify the clustering structure
Parameter tuning or default values? an empirical investigation in search-based software engineering
Empir. Softw. Eng.
K-Means Vs Mini Batch K-Means: A ComparisonTech. Rep.
Impact of the distribution parameter of data sampling approaches on software defect prediction models
Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models
The significant effects of data sampling approaches on software defect prioritization and classification
Software fault prediction using quad tree-based k-means clustering algorithm
IEEE Trans. Knowl. Data Eng. (TKDE)
Clustering and metrics thresholds based software fault prediction of unlabeled program modules
Mean shift, mode seeking, and clustering
IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)
Applying swarm ensemble clustering technique for fault prediction using software metrics
Evaluating defect prediction approaches: a benchmark and an extensive comparison
Empir. Softw. Eng.
Fuzzy shell-clustering and applications to circle detection in digital images
Int. J. Gen. Syst.
Maximum likelihood from incomplete data via the em algorithm
J. R. Stat. Soc.
A developer centered bug prediction model
IEEE Trans. Softw. Eng. (TSE)
A developer centered bug prediction model
Trans. Softw. Eng.
Cluster merging and splitting in hierarchical clustering algorithms
Knowledge acquisition via incremental conceptual clustering
Mach. Learn.
Clustering by passing messages between data points
Science
Some Competitive Learning Methods
Estimating the number of faults in code
IEEE Trans. Softw. Eng. (TSE)
Choosing software metrics for defect prediction: an investigation on feature selection techniques
Softw. - Pract. Exp.
Varying defect prediction approaches during project evolution: A preliminary investigation
Revisiting the impact of classification techniques on the performance of defect prediction models
A large-scale study of the impact of feature selection techniques on defect classification models
Cure: an efficient clustering algorithm for large databases
Software quality prediction using mixture models with em algorithm
Analysis of clustering techniques for software quality prediction
Software quality analysis of unlabeled program moduls with fuzzy-c means clustering techniques
IMRS’s Int. J. Eng. Sci.
Estimating of software quality with clustering techniques
A systematic literature review on fault prediction performance in software engineering
IEEE Trans. Softw. Eng. (TSE)
Data Mining: Concepts and Techniques
Algorithm as 136: A k-means clustering algorithm
J. R. Stat. Soc.
Predicting faults using the complexity of code changes
Cited by (36)
On the relative value of clustering techniques for Unsupervised Effort-Aware Defect Prediction
2024, Expert Systems with ApplicationsA BERT-Based Sequential POI Recommender system in Social Media
2024, Computer Standards and InterfacesAn unsupervised feature selection approach for actionable warning identification
2023, Expert Systems with ApplicationsDetecting multi-type self-admitted technical debt with generative adversarial network-based neural networks
2023, Information and Software TechnologySoftware defect prediction with semantic and structural information of codes based on Graph Neural Networks
2022, Information and Software TechnologyCitation Excerpt :Based on previous studies, the commonly used features in SDP can be roughly divided into three categories, i.e., traditional static hand-crafted features, internal semantic information of source code, and external structural information of software network. Most conventional SDP studies focus on hand-crafted code metrics, which are fed into machine learning classifiers to predict defects [2,9,23–29]. Recently, tree-based neural network algorithms have attracted much attention in capturing semantic information from ASTs [5–9].
A novel Sequence-Aware personalized recommendation system based on multidimensional information
2022, Expert Systems with ApplicationsCitation Excerpt :Compared to mean-shift and k-means, the DBSCAN algorithm has several advantages over other clustering algorithms for identifying POIs and removing anomalous points from the dataset (Sun & Lee, 2017; Pla-Sacristán, Gonzalez-Diaz, Martinez-Cortes, & Diaz-de-Maria, 2019). It requires minimal scope information to specify the parameters and the ability to distinguish clusters via optional space and filter scatters (Xu et al., 2021). Moreover, it can discover clusters of any shape and is insensitive to noise; it can filter outliers and performs well on a large scale (Xu, Chen, & Chen, 2015).
Zhou Xu received the B.S. degree from Huazhong Agricultural University, China, in 2014 and the Ph.D. degree from Wuhan University, China, in 2019. He is now an Assistant Professor at the School of Big Data and Software Engineering, Chongqing University, China. His research interests include software defect prediction, feature engineering, and machine learning.
Li Li is a Lecturer in the Faculty of Information Technology, Monash University. Prior to joining Monash, he was a research associate in Software Engineering at the University of Luxembourg and an honorary research associate at University College London. He obtained his Ph.D. degree in November 2016 from the University of Luxembourg. His research interests are in the areas of Android Security, Static Analysis, Machine Learning, Deep Learning, and Empirical Study.
Meng Yan is now an Assistant Professor at the School of Big Data and Software Engineering, Chongqing University, China. Prior to joining Chongqing University, he was a Postdoc at Zhejiang University advised by Prof.Shanping Li and Dr. Xin Xia. he got his Ph.D degree in June 2017 under the supervision of Prof. Xiaohong Zhang from Chongqing University, China. H is currently research focuses on how to improve developers’ productivity, how to improve software quality and how to reduce the effort during software development by analyzing rich software repository data.
Jin Liu received the Ph.D. degree in computer science from the State Key Lab of Software Engineering, Wuhan University, China, in 2005. He is currently a Professor in the School of Computer Science, Wuhan University. His research interests include software engineering, machine learning, and interactive collaboration on the Web.
Xiapu Luo received the Ph.D. degree in computer science from The Hong Kong Polytechnic University in 2007, and was a Post-Doctoral Research Fellow with the Georgia Institute of Technology. Now, he is an Associate Professor with the Department of Computing and an Associate Researcher with the Shenzhen Research Institute, The Hong Kong Polytechnic University. His current research focuses on smart phone security and privacy, network security and privacy, and Internet measurement.
John Grundy is the Senior Deputy Dean for the Faculty of Information Technology and a Professor of Software Engineering at Monash University. He hold the BSc(Hons),M.Sc. and Ph.D. degrees, all in Computer Science, from the University of Auckland. He is a Fellow of Automated Software Engineering, Fellow of Engineers Australia, Certified Professional Engineer, Engineering Executive, Member of the ACM and Senior Member of the IEEE.
Yifeng Zhang received his master degree from Wuhan University, China, in 2019. His research interest focuses on software engineering and machine learning.
Xiaohong Zhang received the M.S. degree in applied mathematics and the Ph.D. degree in computer software and theory from Chongqing University, China, in 2006. He is currently a Professor and the Vice Dean of the School of Big Data and Software Engineering, Chongqing University. His current research interests include data mining of software engineering, topic modeling, image semantic analysis, and video analysis.