当前位置: X-MOL 学术Inf. Softw. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction
Information and Software Technology ( IF 3.9 ) Pub Date : 2020-09-25 , DOI: 10.1016/j.infsof.2020.106432
Shuo Feng , Jacky Keung , Xiao Yu , Yan Xiao , Kwabena Ebo Bennin , Md Alamgir Kabir , Miao Zhang

Context:

Generally, there are more non-defective instances than defective instances in the datasets used for software defect prediction (SDP), which is referred to as the class imbalance problem. Oversampling techniques are frequently adopted to alleviate the problem by generating new synthetic defective instances. Existing techniques generate either near-duplicated instances which result in overgeneralization (high probability of false alarm, pf) or overly diverse instances which hurt the prediction model’s ability to find defects (resulting in low probability of detection, pd). Furthermore, when existing oversampling techniques are applied in SDP, the effort needed to inspect the instances with different complexity is not taken into consideration.

Objective:

In this study, we introduce Complexity-based OverSampling TEchnique (COSTE), a novel oversampling technique that can achieve low pf and high pd simultaneously. Meanwhile, COSTE also performs better in terms of Norm(popt) and ACC, two effort-aware measures that consider the testing effort.

Method:

COSTE combines pairs of defective instances with similar complexity to generate synthetic instances, which improves the diversity within the data, maintains the ability of prediction models to find defects, and takes the different testing effort needed for different instances into consideration. We conduct experiments to compare COSTE with Synthetic Minority Oversampling TEchnique, Borderline-SMOTE, Majority Weighted Minority Oversampling TEchnique and MAHAKIL.

Results:

The experimental results on 23 releases of 10 projects show that COSTE greatly improves the diversity of the synthetic instances without compromising the ability of prediction models to find defects. In addition, COSTE outperforms the other oversampling techniques under the same testing effort. The statistical analysis indicates that COSTE’s ability to outperform the other oversampling techniques is significant under the statistical Wilcoxon rank sum test and Cliff’s effect size.

Conclusion:

COSTE is recommended as an efficient alternative to address the class imbalance problem in SDP.



中文翻译:

COSTE:基于复杂度的过采样技术,可减轻软件缺陷预测中的类不平衡问题

内容:

通常,在用于软件缺陷预测(SDP)的数据集中,非缺陷实例比缺陷实例更多,这被称为类不平衡问题。经常采用过采样技术以通过生成新的合成缺陷实例来缓解此问题。现有技术会生成几乎重复的实例,从而导致总体化过度(错误警报的可能性很高,pF)或过于多样化的实例,这些实例会损害预测模型的发现缺陷的能力(导致发现概率较低, pd)。此外,当在SDP中应用现有的过采样技术时,不会考虑检查具有不同复杂性的实例所需的工作。

目的:

在这项研究中,我们介绍了基于复杂度的过采样技术(COSTE),这是一种可以实现 pFpd同时。同时,COSTE在以下方面也表现更好ñØ[RpØpŤ一种CC,这是两个考虑测试工作量的可感知工作量的措施。

方法:

COSTE将复杂程度相似的成对缺陷实例组合在一起,以生成综合实例,从而改善了数据的多样性,保持了预测模型发现缺陷的能力,并考虑了不同实例所需的不同测试工作。我们进行了实验,将COSTE与合成少数族裔过采样技术,Borderline-SMOTE,多数加权少数族裔过采样技术和MAHAKIL进行比较。

结果:

在10个项目的23个发行版中的实验结果表明,COSTE大大提高了合成实例的多样性,而没有损害预测模型发现缺陷的能力。此外,在相同的测试工作下,COSTE的性能优于其他过采样技术。统计分析表明,在统计Wilcoxon秩和检验和Cliff效应大小下,COSTE的性能胜过其他过采样技术。

结论:

建议使用COSTE作为解决SDP中类不平衡问题的有效替代方法。

更新日期:2020-11-02
down
wechat
bug