A large empirical assessment of the role of data balancing in machine-learning-based code smell detection,Journal of Systems and Software

当前位置： X-MOL 学术 › J. Syst. Softw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A large empirical assessment of the role of data balancing in machine-learning-based code smell detection
Journal of Systems and Software ( IF 3.5 ) Pub Date : 2020-11-01 , DOI: 10.1016/j.jss.2020.110693
Fabiano Pecorelli , Dario Di Nucci , Coen De Roover , Andrea De Lucia

Abstract Code smells can compromise software quality in the long term by inducing technical debt. For this reason, many approaches aimed at identifying these design flaws have been proposed in the last decade. Most of them are based on heuristics in which a set of metrics is used to detect smelly code components. However, these techniques suffer from subjective interpretations, a low agreement between detectors, and threshold dependability. To overcome these limitations, previous work applied Machine-Learning that can learn from previous datasets without needing any threshold definition. However, more recent work has shown that Machine-Learning is not always suitable for code smell detection due to the highly imbalanced nature of the problem. In this study, we investigate five approaches to mitigate data imbalance issues to understand their impact on Machine Learning-based approaches for code smell detection in Object-Oriented systems and those implementing the Model-View-Controller pattern. Our findings show that avoiding balancing does not dramatically impact accuracy. Existing data balancing techniques are inadequate for code smell detection leading to poor accuracy for Machine-Learning-based approaches. Therefore, new metrics to exploit different software characteristics and new techniques to effectively combine them are needed.

中文翻译：

数据平衡在基于机器学习的代码气味检测中的作用的大型实证评估

摘要从长远来看，代码异味会导致技术债务，从而损害软件质量。出于这个原因，在过去十年中提出了许多旨在识别这些设计缺陷的方法。它们中的大多数基于启发式方法，其中使用一组指标来检测臭代码组件。然而，这些技术受到主观解释、检测器之间的低一致性和阈值可靠性的影响。为了克服这些限制，以前的工作应用了机器学习，可以从以前的数据集中学习，而无需任何阈值定义。然而，最近的工作表明，由于问题的高度不平衡性，机器学习并不总是适合代码异味检测。在这项研究中，我们研究了五种缓解数据不平衡问题的方法，以了解它们对面向对象系统中基于机器学习的代码气味检测方法和实现模型-视图-控制器模式的方法的影响。我们的研究结果表明，避免平衡不会显着影响准确性。现有的数据平衡技术不足以进行代码异味检测，导致基于机器学习的方法准确性较差。因此，需要利用不同软件特性的新指标和有效组合它们的新技术。现有的数据平衡技术不足以进行代码异味检测，导致基于机器学习的方法准确性较差。因此，需要利用不同软件特性的新指标和有效组合它们的新技术。现有的数据平衡技术不足以进行代码异味检测，导致基于机器学习的方法准确性较差。因此，需要利用不同软件特性的新指标和有效组合它们的新技术。

更新日期：2020-11-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>