当前位置: X-MOL 学术J. Syst. Softw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A comprehensive comparative study of clustering-based unsupervised defect prediction models
Journal of Systems and Software ( IF 3.5 ) Pub Date : 2021-02-01 , DOI: 10.1016/j.jss.2020.110862
Zhou Xu , Li Li , Meng Yan , Jin Liu , Xiapu Luo , John Grundy , Yifeng Zhang , Xiaohong Zhang

Abstract Software defect prediction recommends the most defect-prone software modules for optimization of the test resource allocation. The limitation of the extensively-studied supervised defect prediction methods is that they require labeled software modules which are not always available. An alternative solution is to apply clustering-based unsupervised models to the unlabeled defect data, called Clustering-based Unsupervised Defect Prediction (CUDP). However, there are few studies to explore the impacts of clustering-based models on defect prediction performance. In this work, we performed a large-scale empirical study on 40 unsupervised models to fill this gap. We chose an open-source dataset including 27 project versions with 3 types of features. The experimental results show that (1) different clustering-based models have significant performance differences and the performance of models in the instance-violation-score-based clustering family is obviously superior to that of models in hierarchy-based, density-based, grid-based, sequence-based, and hybrid-based clustering families; (2) the models in the instance-violation-score-based clustering family achieves competitive performance compared with typical supervised models; (3) the impacts of feature types on the performance of the models are related to the indicators used; and (4)the clustering-based unsupervised models do not always achieve better performance on defect data with the combination of the 3 types of features.

中文翻译:

基于聚类的无监督缺陷预测模型的综合比较研究

摘要 软件缺陷预测推荐最容易出现缺陷的软件模块,以优化测试资源分配。广泛研究的监督缺陷预测方法的局限性在于它们需要标记的软件模块,而这些模块并不总是可用的。另一种解决方案是将基于聚类的无监督模型应用于未标记的缺陷数据,称为基于聚类的无监督缺陷预测 (CUDP)。然而,很少有研究探讨基于聚类的模型对缺陷预测性能的影响。在这项工作中,我们对 40 个无监督模型进行了大规模的实证研究,以填补这一空白。我们选择了一个开源数据集,包括 27 个项目版本,具有 3 种类型的特征。实验结果表明:(1)不同的基于聚类的模型性能差异显着,基于instance-violation-score的聚类家族模型的性能明显优于基于层次、基于密度、网格的模型-基于、基于序列和基于混合的聚类家族;(2) 与典型的监督模型相比,基于实例违反分数的聚类家族中的模型实现了有竞争力的性能;(3) 特征类型对模型性能的影响与使用的指标有关;(4) 基于聚类的无监督模型在结合 3 类特征的情况下,并不总能在缺陷数据上获得更好的性能。
更新日期:2021-02-01
down
wechat
bug