当前位置: X-MOL 学术Empir. Software Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On the time-based conclusion stability of cross-project defect prediction models
Empirical Software Engineering ( IF 4.1 ) Pub Date : 2020-09-09 , DOI: 10.1007/s10664-020-09878-9
Abdul Ali Bangash , Hareem Sahar , Abram Hindle , Karim Ali

Researchers in empirical software engineering often make claims based on observable data such as defect reports. Unfortunately, in many cases, these claims are generalized beyond the data sets that have been evaluated. Will the researcher's conclusions hold a year from now for the same software projects? Perhaps not. Recent studies show that in the area of Software Analytics, conclusions over different data sets are usually inconsistent. In this article, we empirically investigate whether conclusions in the area of defect prediction truly exhibit stability throughout time or not. Our investigation applies a time-aware evaluation approach where models are trained only on the past, and evaluations are executed only on the future. Through this time-aware evaluation, we show that depending on which time period we evaluate defect predictors, their performance, in terms of F-Score, the area under the curve (AUC), and Mathews Correlation Coefficient (MCC), varies and their results are not consistent. The next release of a product, which is significantly different from its prior release, may drastically change defect prediction performance. Therefore, without knowing about the conclusion stability, empirical software engineering researchers should limit their claims of performance within the contexts of evaluation, because broad claims about defect prediction performance might be contradicted by the next upcoming release of a product under analysis.

中文翻译:

跨项目缺陷预测模型时基结论稳定性研究

经验软件工程的研究人员经常根据可观察到的数据(例如缺陷报告)做出声明。不幸的是,在许多情况下,这些声明超出了已评估的数据集。对于相同的软件项目,研究人员的结论会在一年后成立吗?也许不是。最近的研究表明,在软件分析领域,不同数据集的结论通常是不一致的。在本文中,我们凭经验调查缺陷预测领域的结论是否真正表现出整个时间的稳定性。我们的调查应用了一种时间感知评估方法,其中模型仅针对过去进行训练,而评估仅针对未来执行。通过这种时间感知评估,我们表明根据我们评估缺陷预测器的时间段,他们在 F-Score、曲线下面积 (AUC) 和 Mathews 相关系数 (MCC) 方面的表现各不相同,结果也不一致。产品的下一个版本与其之前的版本明显不同,可能会极大地改变缺陷预测性能。因此,在不知道结论稳定性的情况下,经验软件工程研究人员应该将他们的性能声明限制在评估范围内,因为关于缺陷预测性能的广泛声明可能与下一个即将发布的被分析产品相矛盾。可能会彻底改变缺陷预测性能。因此,在不知道结论稳定性的情况下,经验软件工程研究人员应该将他们的性能声明限制在评估范围内,因为关于缺陷预测性能的广泛声明可能与下一个即将发布的被分析产品相矛盾。可能会彻底改变缺陷预测性能。因此,在不知道结论稳定性的情况下,经验软件工程研究人员应该将他们的性能声明限制在评估范围内,因为关于缺陷预测性能的广泛声明可能与下一个即将发布的被分析产品相矛盾。
更新日期:2020-09-09
down
wechat
bug