A comprehensive comparative study of clustering-based unsupervised defect prediction models

https://doi.org/10.1016/j.jss.2020.110862Get rights and content

Highlights

  • Conducting a detailed analysis for SDP studies involving clustering methods.

  • Applying 40 clustering models to 27 project versions with 3 types of features.

  • Using both traditional and effort-aware indicators for performance evaluation.

  • Implementing a framework integrating 40 unsupervised defect prediction models.

Abstract

Software defect prediction recommends the most defect-prone software modules for optimization of the test resource allocation. The limitation of the extensively-studied supervised defect prediction methods is that they require labeled software modules which are not always available. An alternative solution is to apply clustering-based unsupervised models to the unlabeled defect data, called Clustering-based Unsupervised Defect Prediction (CUDP). However, there are few studies to explore the impacts of clustering-based models on defect prediction performance. In this work, we performed a large-scale empirical study on 40 unsupervised models to fill this gap. We chose an open-source dataset including 27 project versions with 3 types of features. The experimental results show that (1) different clustering-based models have significant performance differences and the performance of models in the instance-violation-score-based clustering family is obviously superior to that of models in hierarchy-based, density-based, grid-based, sequence-based, and hybrid-based clustering families; (2) the models in the instance-violation-score-based clustering family achieves competitive performance compared with typical supervised models; (3) the impacts of feature types on the performance of the models are related to the indicators used; and (4)the clustering-based unsupervised models do not always achieve better performance on defect data with the combination of the 3 types of features.

Introduction

The defects hidden in software modules threaten the security and decrease the reliability of the software product. Therefore, it is essential to fix the defective modules before delivering the product.

Defect fixing is a complex and time-consuming task, and limited testing resources are usually unaffordable for supporting thorough code reviews (Geremia and Tamburri, 2018). This requests a prioritization to better analyze the software product. In other words, developers and testers should reasonably allocate the limited resources to test the modules that have a high probability to contain defects. To seek for such prioritization, Software Defect Prediction (SDP) is proposed to identify the most defect-prone modules for priority inspection. The most active SDP methods are supervised models which first train a classifier on labeled modules and then use it to determine whether or not the unlabeled modules contain defects. However, the supervised SDP models need the labeled modules of historical data of the current project or external projects which are not always available.

In order to conduct defect prediction on unlabeled data, Unsupervised Defect Prediction (UDP) models are possible for this task. As UDP models do not need any labeled data, they have attracted many researchers’ attention in recent years. There are 2 types of UDP models: Clustering-based Unsupervised Defect Prediction (CUDP) methods (such as the studies Zhong et al., 2004a, Bishnu and Bhattacherjee, 2012, Zhang et al., 2016) and Ranking-based Unsupervised Defect Prediction (RUDP) methods (such as the studies Yang et al., 2016, Fu and Menzies, 2017, Yan et al., 2017, Huang et al., 2017). RUDP methods select one feature to rank modules based on the corresponding values. The rationale behind this type of method is based on the assumption that the feature values and the defect-proneness of the modules have a direct or inverse proportional relationship (Yang et al., 2016). However, such a relationship does not exist in all features, which leads to inconsistent conclusions in previous studies. For example, Yang et al. (2016) found that RUDP methods performed significantly better than supervised models on change-level just-in-time defect data, but Yan et al. (2017) found that the conclusion in Yang et al. (2016) does not hold on a file-level benchmark dataset. Thus, more work is needed to investigate and verify the generalization of RUDP on different defect data. In addition, RUDP methods need a threshold (such as the proportion of the top-ranked modules) to divide the modules into two groups for calculating some performance indicators, such as F-measure. However, this threshold is not easy to be determined. Unlike RUDP methods, CUDP methods do not rely on the relationship between a specific feature and the defect label to rank the modules, thus avoiding the above contradictory conclusions. CUDP methods divide the modules into different groups based on a specific rule without relying on a threshold. In this work, we focused on CUDP methods and their performance on defect data with different feature sets.

The general process of CUDP methods consists of the following 2 steps: (1) leveraging a similarity metric to cluster unlabeled modules into different groups where the modules in the same group are more similar to each other compared with those in other groups. This step is based on the information found in the data that describes the relationships among the modules; (2) applying a specific strategy to annotate each group as defective or non-defective. In previous studies, researchers have applied some clustering-based methods to unlabeled defect data. For example, in early studies, researchers employed classic clustering methods like K-means algorithm (Zhong et al., 2004a) and self-organizing maps algorithm (Abaei et al., 2013) to group the modules. In more recent studies, researchers designed specific methods to cluster the modules, such as clustering and label method (Nam and Kim, 2015), and average clustering method (Yang and Qian, 2016).

There are several limitations in existing CUDP approaches: (1) there are few studies conducting a systematic literature review towards CUDP articles; (2) all previous studies focus on using existing methods or developing new methods to cluster unlabeled modules for SDP, but few studies have explored the performance differences of various clustering-based methods for UDP; (3) previous studies have shown that different feature types have impacts on the SDP performance of supervised models (Moser et al., 2008, Zimmermann and Nagappan, 2008, Radjenović et al., 2013, Kaur et al., 2015), but to our best knowledge, there is no study explored the impacts of feature types on the SDP performance of the clustering-based methods (i.e., the CUDP performance); and (4) all previous studies evaluated the CUDP performance with traditional indicators that do not consider the inspecting efforts for modules, but no study has employed the more practical effort-aware indicators.

Motivated by these limitations, in this work we conducted a large-scale empirical study to analyze the performance differences of 40 clustering-based unsupervised models (as well as 6 supervised models for comparison) on a public benchmark dataset. This dataset consists of 14 projects with a total of 27 versions in which 3 kinds of features are collected for each project. We evaluated these methods with one traditional and 2 effort-aware indicators. The experimental results show that (1) there exist significant performance differences among these methods, and the hierarchy-based, density-based, grid-based, sequence-based, and hybrid-based clustering models perform significantly worse for CUDP task in most cases; (2) some clustering-based unsupervised models, such as the instance-violation-score-based clustering methods, can achieve even better performance than the typical supervised models; (3) the CUDP performance of the methods on different indicators is affected by the feature types of the defect data; (4) the supervised models usually perform better on defect data with multiple feature types, while the phenomenon does not conform to the clustering-based unsupervised models.

The main contributions of this study include:

  • (1)

    We retrieved and analyzed existing SDP studies involving clustering methods from different perspectives, such as the used datasets, feature types, performance indicators, clustering methods, and labeling schemes. To the best of our knowledge, this is the first work to conduct such a detailed analysis for CUDP studies.

  • (2)

    We applied 40 clustering-based models from 9 clustering families to 27 project versions who have 3 types of features. In addition, we employed both traditional and effort-aware indicators to evaluate the performance of these methods. To our best knowledge, we were among the first to conduct such a wide-ranging empirical study for investigating the impacts of feature types on the CUDP performance and use both kinds of indicators for synthetically evaluating the CUDP performance.

  • (3)

    We designed and implemented an experimental framework which integrates 40 clustering-based unsupervised SDP models from multiple libraries. We further made the framework public available and encouraged our fellow researchers to integrate their state-of-the-art clustering models to this framework for further comparative studies.

The remainder of the paper is organized as follows: Section 2 introduces the studied 40 clustering-based unsupervised models and summaries the existing studies related to CUDP. Section 3 describes the design of our empirical study. Section 4 reports our experimental results. Section 5 discusses the implications from the experimental results and the potential validity threats. Section 6 presents different types of empirical studies in SDP domain. Section 7 concludes this paper and draws potential future directions.

Section snippets

Taxonomy for clustering-based unsupervised models

As clustering-based unsupervised models identify defective software modules without requiring the participation of labeled modules, it is meaningful to seek models that can achieve similar or better performance than supervised models for defect prediction. We briefly introduced our studied 40 unsupervised models from 9 clustering families.

Comparative methods

To investigate if there exist any clustering based models that can outperform the supervised models for SDP, we chose some representative supervised models for comparison. Although one previous study (Ghotra et al., 2015) has investigated more than 30 supervised classification models for defect prediction, it is not suitable for us to consider all these models. As Hall et al. (2011) stated that simple classification models can also perform well on SDP task, in this work we just selected 6

Results for RQ1

Since we needed to perform a total of 46 methods (40 unsupervised models and 6 supervised models) on 27 defect data with 100 times, we obtained 124200 (46 × 27 × 100) records of the performance results for this question.

Fig. 8 depicts the box-plots of 3 indicators on defect data with code complexity features. We reported both the average and median indicator values represented by the colored point and bands inside the boxes, respectively. The boxes with different colors imply distinct meanings

Implications

We provided some implications from the analysis of our experimental results for practitioners and researchers.

  • (1)

    The methods in the HBC, GBC, and HC families should be avoided in practice for defect prediction. The reason is that no methods from the above families perform well on all indicators and on 2 effort-aware indicators over defect data with any kind of feature types and the combined features. This may explain why the methods in these 3 families were not explored in previous studies. We

Related work

The topic of SDP has been an active research field and been widely studied in the last two decades. Recent studies on this topic can be roughly divided into 3 categories. The first category is that the researchers employed ready-made techniques or proposed new methods for SDP task in which machine learning methods are the mainstream trends. This type of studies aims to improve the performance of detecting the defective modules, such as the work in Jing et al., 2014, Xia et al., 2016 and Jing et

Conclusions

We conducted a large-scale comparison to analyze SDP performance differences among 40 clustering-based unsupervised models and 6 typical supervised models. We made the first step towards investigating the impacts of the feature types of defect data on the performance of these methods. Our experimental results on 81 defect data indicate that not all clustering-based unsupervised models are worse than the supervised models, and the performance of the methods in the IVSBC family is particularly

CRediT authorship contribution statement

Zhou Xu: Writing - original draft, Methodology, Data curation. Li Li: Conceptualization, Writing - review & editing. Meng Yan: Supervision, Formal analysis. Jin Liu: Supervision. Xiapu Luo: Writing - review & editing. John Grundy: Writing - review & editing. Yifeng Zhang: Methodology, Software, Visualization. Xiaohong Zhang: Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The authors would like to thank Jaechang Nam for providing the source code of CLA and CLAMI. This work is supported by the National Key Research and Development Project (No. 2018YFB2101200), the National Natural Science Foundation of China (Nos. 62002034, 61972290), the Fundamental Research Funds for the Central Universities (Nos. 2020CDJQY-A021, 2020CDCGRJ072), China Postdoctoral Science Foundation (No. 2020M673137), the Natural Science Foundation of Chongqing in China (No.

Zhou Xu received the B.S. degree from Huazhong Agricultural University, China, in 2014 and the Ph.D. degree from Wuhan University, China, in 2019. He is now an Assistant Professor at the School of Big Data and Software Engineering, Chongqing University, China. His research interests include software defect prediction, feature engineering, and machine learning.

References (130)

  • AbaeiG. et al.

    Fault prediction by utilizing self-organizing map and threshold

  • AgrawalR. et al.

    Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Vol. 27

    (1998)
  • AlboukadelK.

    Hierarchical k-means clustering

    (2017)
  • AlboukadelK.

    Hierarchical clustering on principal components

    (2017)
  • AnkerstM. et al.

    Optics: ordering points to identify the clustering structure

  • ArcuriA. et al.

    Parameter tuning or default values? an empirical investigation in search-based software engineering

    Empir. Softw. Eng.

    (2013)
  • Béjar AlonsoJ.

    K-Means Vs Mini Batch K-Means: A ComparisonTech. Rep.

    (2013)
  • BenninK.E. et al.

    Impact of the distribution parameter of data sampling approaches on software defect prediction models

  • BenninK.E. et al.

    Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models

  • BenninK.E. et al.

    The significant effects of data sampling approaches on software defect prioritization and classification

  • BishnuP.S. et al.

    Software fault prediction using quad tree-based k-means clustering algorithm

    IEEE Trans. Knowl. Data Eng. (TKDE)

    (2012)
  • CatalC. et al.

    Clustering and metrics thresholds based software fault prediction of unlabeled program modules

  • CatalC. et al.
  • ChengY.

    Mean shift, mode seeking, and clustering

    IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)

    (1995)
  • CoelhoR.A. et al.

    Applying swarm ensemble clustering technique for fault prediction using software metrics

  • D’AmbrosM. et al.

    Evaluating defect prediction approaches: a benchmark and an extensive comparison

    Empir. Softw. Eng.

    (2012)
  • DaveR.N.

    Fuzzy shell-clustering and applications to circle detection in digital images

    Int. J. Gen. Syst.

    (1990)
  • DempsterA.P. et al.

    Maximum likelihood from incomplete data via the em algorithm

    J. R. Stat. Soc.

    (1977)
  • Di NucciD. et al.

    A developer centered bug prediction model

    IEEE Trans. Softw. Eng. (TSE)

    (2017)
  • Di NucciD. et al.

    A developer centered bug prediction model

    Trans. Softw. Eng.

    (2018)
  • DingC. et al.

    Cluster merging and splitting in hierarchical clustering algorithms

  • Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al., 1996. A density-based algorithm for discovering clusters in...
  • FisherD.H.

    Knowledge acquisition via incremental conceptual clustering

    Mach. Learn.

    (1987)
  • FreyB.J. et al.

    Clustering by passing messages between data points

    Science

    (2007)
  • FritzkeB.

    Some Competitive Learning Methods

    (1997)
  • Fu, W., Menzies, T., 2017. Revisiting unsupervised learning for defect prediction. In: Proceedings of the 2017 11th...
  • GaffneyJ.E.

    Estimating the number of faults in code

    IEEE Trans. Softw. Eng. (TSE)

    (1984)
  • GaoK. et al.

    Choosing software metrics for defect prediction: an investigation on feature selection techniques

    Softw. - Pract. Exp.

    (2011)
  • GeremiaS. et al.

    Varying defect prediction approaches during project evolution: A preliminary investigation

  • GhotraB. et al.

    Revisiting the impact of classification techniques on the performance of defect prediction models

  • GhotraB. et al.

    A large-scale study of the impact of feature selection techniques on defect classification models

  • GuhaS. et al.

    Cure: an efficient clustering algorithm for large databases

  • GuoP. et al.

    Software quality prediction using mixture models with em algorithm

  • GuptaD. et al.

    Analysis of clustering techniques for software quality prediction

  • GuptaD. et al.

    Software quality analysis of unlabeled program moduls with fuzzy-c means clustering techniques

    IMRS’s Int. J. Eng. Sci.

    (2012)
  • GuptaD. et al.

    Estimating of software quality with clustering techniques

  • HallT. et al.

    A systematic literature review on fault prediction performance in software engineering

    IEEE Trans. Softw. Eng. (TSE)

    (2011)
  • HanJ. et al.

    Data Mining: Concepts and Techniques

    (2011)
  • HartiganJ.A. et al.

    Algorithm as 136: A k-means clustering algorithm

    J. R. Stat. Soc.

    (1979)
  • HassanA.E.

    Predicting faults using the complexity of code changes

  • Cited by (36)

    • Software defect prediction with semantic and structural information of codes based on Graph Neural Networks

      2022, Information and Software Technology
      Citation Excerpt :

      Based on previous studies, the commonly used features in SDP can be roughly divided into three categories, i.e., traditional static hand-crafted features, internal semantic information of source code, and external structural information of software network. Most conventional SDP studies focus on hand-crafted code metrics, which are fed into machine learning classifiers to predict defects [2,9,23–29]. Recently, tree-based neural network algorithms have attracted much attention in capturing semantic information from ASTs [5–9].

    • A novel Sequence-Aware personalized recommendation system based on multidimensional information

      2022, Expert Systems with Applications
      Citation Excerpt :

      Compared to mean-shift and k-means, the DBSCAN algorithm has several advantages over other clustering algorithms for identifying POIs and removing anomalous points from the dataset (Sun & Lee, 2017; Pla-Sacristán, Gonzalez-Diaz, Martinez-Cortes, & Diaz-de-Maria, 2019). It requires minimal scope information to specify the parameters and the ability to distinguish clusters via optional space and filter scatters (Xu et al., 2021). Moreover, it can discover clusters of any shape and is insensitive to noise; it can filter outliers and performs well on a large scale (Xu, Chen, & Chen, 2015).

    View all citing articles on Scopus

    Zhou Xu received the B.S. degree from Huazhong Agricultural University, China, in 2014 and the Ph.D. degree from Wuhan University, China, in 2019. He is now an Assistant Professor at the School of Big Data and Software Engineering, Chongqing University, China. His research interests include software defect prediction, feature engineering, and machine learning.

    Li Li is a Lecturer in the Faculty of Information Technology, Monash University. Prior to joining Monash, he was a research associate in Software Engineering at the University of Luxembourg and an honorary research associate at University College London. He obtained his Ph.D. degree in November 2016 from the University of Luxembourg. His research interests are in the areas of Android Security, Static Analysis, Machine Learning, Deep Learning, and Empirical Study.

    Meng Yan is now an Assistant Professor at the School of Big Data and Software Engineering, Chongqing University, China. Prior to joining Chongqing University, he was a Postdoc at Zhejiang University advised by Prof.Shanping Li and Dr. Xin Xia. he got his Ph.D degree in June 2017 under the supervision of Prof. Xiaohong Zhang from Chongqing University, China. H is currently research focuses on how to improve developers’ productivity, how to improve software quality and how to reduce the effort during software development by analyzing rich software repository data.

    Jin Liu received the Ph.D. degree in computer science from the State Key Lab of Software Engineering, Wuhan University, China, in 2005. He is currently a Professor in the School of Computer Science, Wuhan University. His research interests include software engineering, machine learning, and interactive collaboration on the Web.

    Xiapu Luo received the Ph.D. degree in computer science from The Hong Kong Polytechnic University in 2007, and was a Post-Doctoral Research Fellow with the Georgia Institute of Technology. Now, he is an Associate Professor with the Department of Computing and an Associate Researcher with the Shenzhen Research Institute, The Hong Kong Polytechnic University. His current research focuses on smart phone security and privacy, network security and privacy, and Internet measurement.

    John Grundy is the Senior Deputy Dean for the Faculty of Information Technology and a Professor of Software Engineering at Monash University. He hold the BSc(Hons),M.Sc. and Ph.D. degrees, all in Computer Science, from the University of Auckland. He is a Fellow of Automated Software Engineering, Fellow of Engineers Australia, Certified Professional Engineer, Engineering Executive, Member of the ACM and Senior Member of the IEEE.

    Yifeng Zhang received his master degree from Wuhan University, China, in 2019. His research interest focuses on software engineering and machine learning.

    Xiaohong Zhang received the M.S. degree in applied mathematics and the Ph.D. degree in computer software and theory from Chongqing University, China, in 2006. He is currently a Professor and the Vice Dean of the School of Big Data and Software Engineering, Chongqing University. His current research interests include data mining of software engineering, topic modeling, image semantic analysis, and video analysis.

    View full text