How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation
arXiv - CS - Computation and Language Pub Date : 2021-06-10 , DOI: arxiv-2106.05532
Swaroop Mishra, Anjana Arunkumar

Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their `difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance -- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average.

中文翻译：

模型排名的稳健性：公平评估的排行榜定制方法

在实际应用程序中部署时，排行榜上的模型通常表现不尽如人意；这就需要进行严格而昂贵的部署前模型测试。模型性能迄今为止尚未探索的一个方面是：我们的排行榜是否进行了公平的评估？在本文中，我们介绍了一种与任务无关的方法，通过根据样本的“难度”级别对样本进行加权来探测排行榜。我们发现排行榜可能会受到对抗性攻击，并且表现最好的模型可能并不总是最好的模型。我们随后提出了替代评估指标。我们对 10 个模型的实验显示模型排名发生了变化，并且之前报告的性能总体下降——从而纠正了对 AI 系统能力的高估。受行为测试原则的启发，我们进一步开发了一个可视化分析工具的原型，该工具可以根据最终用户的关注领域通过定制来改进排行榜。这有助于用户分析模型的优缺点，并指导他们选择最适合其应用场景的模型。在一项用户研究中，涵盖 5 个重点领域的各种商业产品开发团队的成员发现，我们的原型将部署前的开发和测试工作平均减少了 41%。

更新日期：2021-06-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文