Evaluating the impact of feature selection consistency in software prediction,Science of Computer Programming

当前位置： X-MOL 学术 › Sci. Comput. Program. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating the impact of feature selection consistency in software prediction
Science of Computer Programming ( IF 1.5 ) Pub Date : 2021-09-13 , DOI: 10.1016/j.scico.2021.102715
Asad Ali ₁ , Carmine Gravino ₁

Affiliation

Many empirical software engineering studies have employed feature selection algorithms to exclude the irrelevant and redundant features from the datasets with the aim to improve prediction accuracy achieved with machine learning-based estimation models as well as their generalizability. However, little has been done to investigate how consistently these feature selection algorithms produce features/metrics across different training samples, which is an important point for the interpretation of the trained models. The interpretation of the models largely depends on the features of the analyzed datasets, so it is recommended to evaluate the potential of various feature selection algorithms in terms of how consistently they extract features from the employed datasets. In this study, we consider eight different feature selection algorithms to evaluate how consistently they select features across different folds of k-fold cross-validation as well as when small changes are made in the training data. To provide a stable and generalized conclusion, we investigate data from two different domains, i.e., six datasets from the domain of Software Development Effort Estimation (SDEE) and six datasets from the Software Fault Prediction (SFP) domain. Our results reveal that a feature selection algorithm could produce 20-100% inconsistent features with an SDEE dataset and 18.8-95.3% inconsistent features in the case of an SFP dataset. The analysis also reveals that it is not necessarily true that the most consistent feature selection algorithm results to be the most accurate one (i.e., leads to better prediction accuracy) in the case of SDEE datasets, while with SFP datasets, the analysis highlights that the most consistent feature selection algorithm also results to be the most accurate in predicting faults.

中文翻译：

评估特征选择一致性对软件预测的影响

许多经验软件工程研究采用特征选择算法从数据集中排除不相关和冗余的特征，目的是提高基于机器学习的估计模型实现的预测精度及其泛化性。然而，很少有人研究这些特征选择算法如何在不同的训练样本中产生特征/指标的一致性，这是解释训练模型的一个重要点。模型的解释在很大程度上取决于分析数据集的特征，因此建议评估各种特征选择算法的潜力，即它们从所使用的数据集中提取特征的一致性。在这项研究中，我们考虑了八种不同的特征选择算法来评估它们在 k 折交叉验证的不同折叠中选择特征的一致性，以及在训练数据中进行小的更改时。为了提供稳定和概括的结论，我们调查了来自两个不同领域的数据，即来自软件开发工作量估计 (SDEE) 领域的六个数据集和来自软件故障预测 (SFP) 领域的六个数据集。我们的结果表明，特征选择算法可以在 SDEE 数据集的情况下产生 20-100% 的不一致特征，在 SFP 数据集的情况下产生 18.8-95.3% 的不一致特征。分析还表明，最一致的特征选择算法不一定是最准确的（即，

更新日期：2021-09-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11