An automatically created novel bug dataset and its validation in bug prediction,Journal of Systems and Software

当前位置： X-MOL 学术 › J. Syst. Softw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An automatically created novel bug dataset and its validation in bug prediction
Journal of Systems and Software ( IF 3.5 ) Pub Date : 2020-11-01 , DOI: 10.1016/j.jss.2020.110691
Rudolf Ferenc , Péter Gyimesi , Gábor Gyimesi , Zoltán Tóth , Tibor Gyimóthy

Bugs are inescapable during software development due to frequent code changes, tight deadlines, etc.; therefore, it is important to have tools to find these errors. One way of performing bug identification is to analyze the characteristics of buggy source code elements from the past and predict the present ones based on the same characteristics, using e.g. machine learning models. To support model building tasks, code elements and their characteristics are collected in so-called bug datasets which serve as the input for learning. We present the \emph{BugHunter Dataset}: a novel kind of automatically constructed and freely available bug dataset containing code elements (files, classes, methods) with a wide set of code metrics and bug information. Other available bug datasets follow the traditional approach of gathering the characteristics of all source code elements (buggy and non-buggy) at only one or more pre-selected release versions of the code. Our approach, on the other hand, captures the buggy and the fixed states of the same source code elements from the narrowest timeframe we can identify for a bug's presence, regardless of release versions. To show the usefulness of the new dataset, we built and evaluated bug prediction models and achieved F-measure values over 0.74.

中文翻译：

自动创建的新错误数据集及其在错误预测中的验证

在软件开发过程中，由于频繁的代码更改、紧迫的期限等原因，Bug 是不可避免的；因此，重要的是要有工具来发现这些错误。执行错误识别的一种方法是分析过去有缺陷的源代码元素的特征，并使用例如机器学习模型基于相同的特征预测当前的特征。为了支持模型构建任务，代码元素及其特征被收集在所谓的错误数据集中，作为学习的输入。我们展示了\emph{BugHunter 数据集}：一种新型的自动构建且免费可用的错误数据集，其中包含具有大量代码指标和错误信息的代码元素（文件、类、方法）。其他可用的错误数据集遵循传统方法，即仅在一个或多个预选的代码发布版本中收集所有源代码元素（错误和非错误）的特征。另一方面，我们的方法从我们可以识别错误存在的最窄时间范围内捕获相同源代码元素的错误和固定状态，而不管发布版本如何。为了展示新数据集的有用性，我们构建并评估了错误预测模型，并获得了超过 0.74 的 F-measure 值。

更新日期：2020-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>