Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study,Symmetry

当前位置： X-MOL 学术 › Symmetry › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study
Symmetry ( IF 2.2 ) Pub Date : 2020-07-09 , DOI: 10.3390/sym12071147
Abdullateef O. Balogun , Shuib Basri , Saipunidzam Mahamad , Said J. Abdulkadir , Malek A. Almomani , Victor E. Adeyemo , Qasem Al-Tashi , Hammed A. Mojeed , Abdullahi A. Imam , Amos O. Bajeh

Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naive Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott–KnottESD and the novel Double Scott–KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.

中文翻译：

特征选择方法对软件缺陷预测模型预测性能的影响：一项广泛的实证研究

特征选择（FS）是缓解高维问题的可行解决方案，并且已经在软件缺陷预测（SDP）的背景下提出了许多 FS 方法。此外，许多关于 FS 方法对 SDP 模型的影响和有效性的实证研究经常导致相互矛盾的实验结果和不一致的发现。这些矛盾可归因于相对研究的局限性，例如数据集较小、FS 搜索方法有限以及各自研究范围内的预测模型不合适。因此，进行广泛的实证研究以解决这些矛盾以指导研究人员并支持实验结论的科学韧性至关重要。在这项研究中，我们使用朴素贝叶斯和决策树分类器调查了来自 4 个软件存储库（NASA、PROMISE、ReLink 和 AEEEM）的 25 个软件缺陷数据集的 46 种 FS 方法的影响。随后的预测模型根据准确性和 AUC 值进行评估。Scott-KnottESD 和新颖的 Double Scott-KnottESD 秩统计方法用于研究 FS 方法的统计排序。实验结果表明，没有一种最好的 FS 方法，因为它们各自的性能取决于分类器、性能评估指标和数据集的选择。但是，我们建议在 SDP 中分别使用基于统计、基于概率和基于分类器的过滤器特征排序 (FFR) 方法。对于过滤子集选择 (FSS) 方法，推荐使用元启发式搜索方法的基于相关性的特征选择 (CFS)。对于包装器特征选择 (WFS) 方法，推荐使用基于 IWSS 的 WFS 方法，因为它优于传统的 SFS 和基于 LHS 的 WFS 方法。

更新日期：2020-07-09

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文