当前位置: X-MOL 学术Inf. Softw. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction
Information and Software Technology ( IF 3.9 ) Pub Date : 2021-06-17 , DOI: 10.1016/j.infsof.2021.106662
Shuo Feng , Jacky Keung , Xiao Yu , Yan Xiao , Miao Zhang

Context:

In practice, software datasets tend to have more non-defective instances than defective ones, which is referred to as the class imbalance problem in software defect prediction (SDP). Synthetic Minority Oversampling TEchnique (SMOTE) and its variants alleviate the class imbalance problem by generating synthetic defective instances. SMOTE-based oversampling techniques were widely adopted as the baselines to compare with the newly proposed oversampling techniques in SDP. However, randomness is introduced during the procedure of SMOTE-based oversampling techniques. If the performance of SMOTE-based oversampling techniques is highly unstable, the conclusion drawn from the comparison between SMOTE-based oversampling techniques and the newly proposed techniques may be misleading and less convincing.

Objective:

This paper aims to investigate the stability of SMOTE-based oversampling techniques. Moreover, a series of stable SMOTE-based oversampling techniques are proposed to improve the stability of SMOTE-based oversampling techniques.

Method:

Stable SMOTE-based oversampling techniques reduce the randomness in each step of SMOTE-based oversampling techniques by selecting defective instances in turn, distance-based selection of K neighbor instances, and evenly distributed interpolation. Besides, we mathematically prove and also empirically investigate the stability of SMOTE-based and stable SMOTE-based oversampling techniques on four common classifiers across 26 datasets in terms of AUC, balance, and MCC.

Results:

The analysis of SMOTE-based and stable SMOTE-based oversampling techniques shows that the performance of stable SMOTE-based oversampling techniques is more stable and better than that of SMOTE-based oversampling techniques. The difference between the worst and best performances of SMOTE-based oversampling techniques is up to 23.3%, 32.6%, and 204.2% in terms of AUC, balance, and MCC, respectively.

Conclusion:

Stable SMOTE-based oversampling techniques should be considered as a drop-in replacement for SMOTE-based oversampling techniques.



中文翻译:

基于 SMOTE 的过采样技术在软件缺陷预测中的稳定性研究

语境:

在实践中,软件数据集的无缺陷实例往往比有缺陷的实例多,这被称为软件缺陷预测(SDP)中的类不平衡问题。合成少数过采样技术 (SMOTE) 及其变体通过生成合成缺陷实例来缓解类不平衡问题。基于 SMOTE 的过采样技术被广泛用作基线,以与 SDP 中新提出的过采样技术进行比较。然而,在基于 SMOTE 的过采样技术的过程中引入了随机性。如果基于 SMOTE 的过采样技术的性能非常不稳定,那么从基于 SMOTE 的过采样技术与新提出的技术之间的比较得出的结论可能会产生误导和说服力。

客观的:

本文旨在研究基于 SMOTE 的过采样技术的稳定性。此外,还提出了一系列稳定的基于 SMOTE 的过采样技术,以提高基于 SMOTE 的过采样技术的稳定性。

方法:

基于 SMOTE 的稳定过采样技术通过依次选择缺陷实例、基于距离的选择来降低基于 SMOTE 的过采样技术每一步的随机性 邻居实例,以及均匀分布的插值。此外,我们在数学上证明并实证研究了基于 AUC 的 26 个数据集的四个常见分类器的基于 SMOTE 和稳定的基于 SMOTE 的过采样技术的稳定性,一种一种nC电子, 和 MCC。

结果:

对基于SMOTE和稳定的基于SMOTE的过采样技术的分析表明,基于稳定的基于SMOTE的过采样技术的性能比基于SMOTE的过采样技术更稳定、更好。基于 SMOTE 的过采样技术的最差和最佳性能之间的 AUC 差异高达 23.3%、32.6% 和 204.2%,一种一种nC电子,和 MCC,分别。

结论:

稳定的基于 SMOTE 的过采样技术应被视为基于 SMOTE 的过采样技术的替代品。

更新日期:2021-06-18
down
wechat
bug