Improving performance with hybrid feature selection and ensemble machine learning techniques for code smell detection,Science of Computer Programming

当前位置： X-MOL 学术 › Sci. Comput. Program. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improving performance with hybrid feature selection and ensemble machine learning techniques for code smell detection
Science of Computer Programming ( IF 1.3 ) Pub Date : 2021-08-17 , DOI: 10.1016/j.scico.2021.102713
Shivani Jain ₁ , Anju Saha ₁

Affiliation

Maintaining large and complex software is a significant task in IT industry. One reason for that is the development of code smells which are design flaws that lead to future bugs and errors. Code smells can be treated with regular refactoring, and their detection is the first step in the software maintenance process. Detecting code smells with machine learning algorithms eliminate the need of extensive knowledge required regarding properties of code smell and threshold values. Ensemble machine learning algorithms use a combination of several same or different classifiers to further aid the performance and reduces the variance. In our study, three hybrid feature selection techniques with ensemble machine learning algorithms are employed to improve the performance in detecting code smells. Seven machine learning classifiers with different kernel variations, along with three boosting designs, two stacking methods, and bagging were implemented. For feature selection, combination of filter-wrapper, filter-embedded, and wrapper-embedded methods have been executed. Performance measures for detecting four code smells are evaluated and are compared with the performance when feature selection is not employed. It is found out that performance measure after application of hybrid feature selection increased, accuracy by 21.43%, AUC value by 53.24%, and f-measure by 76.06%. Univariate ROC with Lasso is the best hybrid feature selection technique with 90.48% accuracy and 94.5% ROC AUC value. Random Forest and Logistic regression are the best performing machine learning classifiers. Data class is most detectable code smell. Stacking always gave better results when compared with individual classifiers.

中文翻译：

使用混合特征选择和集成机器学习技术提高代码气味检测的性能

维护大型复杂的软件是 IT 行业的一项重要任务。原因之一是代码异味的发展，这是导致未来错误和错误的设计缺陷。代码异味可以通过定期重构来处理，它们的检测是软件维护过程的第一步。使用机器学习算法检测代码异味消除了对代码异味和阈值属性所需的广泛知识的需求。集成机器学习算法使用几个相同或不同分类器的组合来进一步提高性能并减少方差。在我们的研究中，三种混合特征选择技术和集成机器学习算法被用来提高检测代码异味的性能。实现了七个具有不同内核变化的机器学习分类器，以及三个提升设计、两种堆叠方法和装袋方法。对于特征选择，已执行过滤器包装、过滤器嵌入和包装器嵌入方法的组合。评估检测四种代码异味的性能指标，并与未采用特征选择时的性能进行比较。发现应用混合特征选择后的性能度量提高了，准确度提高了21.43%，AUC值提高了53.24%，f-measure提高了76.06%。带有套索的单变量 ROC 是最好的混合特征选择技术，具有 90.48% 的准确度和 94.5% 的 ROC AUC 值。随机森林和逻辑回归是性能最好的机器学习分类器。数据类是最可检测的代码气味。

更新日期：2021-08-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>