The impact of using biased performance metrics on software defect prediction research,Information and Software Technology

当前位置： X-MOL 学术 › Inf. Softw. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The impact of using biased performance metrics on software defect prediction research
Information and Software Technology ( IF 3.8 ) Pub Date : 2021-06-17 , DOI: 10.1016/j.infsof.2021.106664
Jingxiu Yao , Martin Shepperd

Context:

Software engineering researchers have undertaken many experiments investigating the potential of software defect prediction algorithms. Unfortunately some widely used performance metrics are known to be problematic, most notably F1, but nevertheless F1 is widely used.

Objective:

To investigate the potential impact of using F1 on the validity of this large body of research.

Method:

We undertook a systematic review to locate relevant experiments and then extract all pairwise comparisons of defect prediction performance using F1 and the unbiased Matthews correlation coefficient (MCC).

Results:

We found a total of 38 primary studies. These contain 12,471 pairs of results. Of these comparisons, 21.95% changed direction when the MCC metric is used instead of the biased F1 metric. Unfortunately, we also found evidence suggesting that F1 remains widely used in software defect prediction research.

Conclusion:

We reiterate the concerns of statisticians that the F1 is a problematic metric outside of an information retrieval context, since we are concerned about both classes (defect-prone and not defect-prone units). This inappropriate usage has led to a substantial number (more than one fifth) of erroneous (in terms of direction) results. Therefore we urge researchers to (i) use an unbiased metric and (ii) publish detailed results including confusion matrices such that alternative analyses become possible.

中文翻译：

使用有偏见的性能指标对软件缺陷预测研究的影响

语境：

软件工程研究人员进行了许多实验来研究软件缺陷预测算法的潜力。不幸的是，已知一些广泛使用的性能指标存在问题，最明显的是 F1，但 F1 仍然被广泛使用。

客观的：

调查使用 F1 对这一大型研究的有效性的潜在影响。

方法：

我们进行了系统审查以定位相关实验，然后使用 F1 和无偏马修斯相关系数 (MCC) 提取缺陷预测性能的所有成对比较。

结果：

我们共发现了 38 项主要研究。这些包含 12,471 对结果。在这些比较中，当使用 MCC 度量而不是有偏差的 F1 度量时，有 21.95% 改变了方向。不幸的是，我们还发现证据表明 F1 仍然广泛用于软件缺陷预测研究。

结论：

我们重申统计学家的担忧，即 F1 是信息检索上下文之外的一个有问题的度量标准，因为我们关注这两个类别（易缺陷和非易缺陷单元）。这种不恰当的使用导致了大量（超过五分之一）的错误（就方向而言）结果。因此，我们敦促研究人员 (i) 使用无偏指标，(ii) 发布包括混淆矩阵在内的详细结果，以便进行替代分析。

更新日期：2021-06-25

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11