Our Evaluation Metric Needs an Update to Encourage Generalization,arXiv - CS - Artificial Intelligence

当前位置： X-MOL 学术 › arXiv.cs.AI › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Our Evaluation Metric Needs an Update to Encourage Generalization
arXiv - CS - Artificial Intelligence Pub Date : 2020-07-14 , DOI: arxiv-2007.06898
Swaroop Mishra, Anjana Arunkumar, Chris Bryan and Chitta Baral

Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and `hack' datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance -- and thus overestimation in AI systems' capabilities -- we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.

中文翻译：

我们的评估指标需要更新以鼓励泛化

在几个流行的基准测试中超过人类表现的模型在暴露于分布外 (OOD) 数据时表现出显着的性能下降。最近的研究表明，模型过度拟合虚假偏见和“黑客”数据集，而不是学习人类等可概括的特征。为了阻止模型性能的膨胀——从而高估人工智能系统的能力——我们提出了一个简单而新颖的评估指标，WOOD Score，它鼓励评估过程中的泛化。

更新日期：2020-07-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文