Is p-value 0.05 enough? A study on the statistical evaluation of classifiers,The Knowledge Engineering Review

当前位置： X-MOL 学术 › Knowl. Eng. Rev. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Is p-value 0.05 enough? A study on the statistical evaluation of classifiers
The Knowledge Engineering Review ( IF 2.1 ) Pub Date : 2020-11-27 , DOI: 10.1017/s0269888920000417
Nadine M. Neumann , Alexandre Plastino , Jony A. Pinto Junior , Alex A. Freitas

Statistical significance analysis, based on hypothesis tests, is a common approach for comparing classifiers. However, many studies oversimplify this analysis by simply checking the condition p-value < 0.05, ignoring important concepts such as the effect size and the statistical power of the test. This problem is so worrying that the American Statistical Association has taken a strong stand on the subject, noting that although the p-value is a useful statistical measure, it has been abusively used and misinterpreted. This work highlights problems caused by the misuse of hypothesis tests and shows how the effect size and the power of the test can provide important information for better decision-making. To investigate these issues, we perform empirical studies with different classifiers and 50 datasets, using the Student’s t-test and the Wilcoxon test to compare classifiers. The results show that an isolated p-value analysis can lead to wrong conclusions and that the evaluation of the effect size and the power of the test contributes to a more principled decision-making.

中文翻译：

p值0.05就够了吗？分类器的统计评价研究

基于假设检验的统计显着性分析是比较分类器的常用方法。然而，许多研究通过简单地检查条件过度简化了这种分析p-value < 0.05，忽略重要概念，例如效应大小和检验的统计功效。这个问题非常令人担忧，以至于美国统计协会在这个问题上采取了强硬立场，并指出尽管p-value 是一种有用的统计量度，它已被滥用和误解。这项工作突出了由误用假设检验引起的问题，并展示了检验的效果大小和功效如何为更好的决策提供重要信息。为了调查这些问题，我们使用不同的分类器和 50 个数据集进行实证研究，使用学生 t 检验和 Wilcoxon 检验来比较分类器。结果表明，孤立的p价值分析可能会导致错误的结论，而对效果大小和测试效力的评估有助于做出更有原则的决策。

更新日期：2020-11-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>