当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard
arXiv - CS - Information Retrieval Pub Date : 2021-02-25 , DOI: arxiv-2102.12887
Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, Emine Yilmaz

Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the "state of the art" (SOTA). Due to the rapid progress being made in information retrieval today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as improvements in the state of the art. Such pronouncements, however, are almost never qualified with significance testing. In the context of the MS MARCO document ranking leaderboard, we pose a specific question: How do we know if a run is significantly better than the current SOTA? We ask this question against the backdrop of recent IR debates on scale types: in particular, whether commonly used significance tests are even mathematically permissible. Recognizing these potential pitfalls in evaluation methodology, our study proposes an evaluation framework that explicitly treats certain outcomes as distinct and avoids aggregating them into a single-point metric. Empirical analysis of SOTA runs from the MS MARCO document ranking leaderboard reveals insights about how one run can be "significantly better" than another that are obscured by the current official evaluation metric (MRR@100).

中文翻译:

对现有技术有重大改进吗?MS MARCO文档排名排行榜的案例研究

排行榜是应用机器学习中现代研究的普遍部分。根据设计,它们将条目按某种线性顺序排序,其中得分最高的条目被认为是“最新技术”(SOTA)。由于当今信息检索特别是神经模型在信息检索方面取得了飞速发展,因此排行榜中的顶部条目已被某些规则所取代。这些被认为是对现有技术的改进。但是,这样的声明几乎从来没有通过重要性测试来限定。在MS MARCO文档排名排行榜的背景下,我们提出了一个具体问题:我们如何知道运行是否比当前的SOTA好得多?我们在最近IR关于规模类型的辩论的背景下提出这个问题:特别是,在数学上是否允许使用常用的显着性检验。认识到评估方法中的这些潜在陷阱,我们的研究提出了一种评估框架,该框架明确地将某些结果视为不同的结果,并避免将它们汇总为单点度量。来自MS MARCO文件排名排行榜的SOTA运行的实证分析揭示了一些见解,即与当前官方评估指标(MRR @ 100)所掩盖的运行相比,哪种运行可以“显着优于”另一种运行。
更新日期:2021-02-26
down
wechat
bug