当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards Meaningful Statements in IR Evaluation. Mapping Evaluation Measures to Interval Scales
arXiv - CS - Information Retrieval Pub Date : 2021-01-07 , DOI: arxiv-2101.02668
Marco Ferrante, Nicola Ferro, Norbert Fuhr

Recently, it was shown that most popular IR measures are not interval-scaled, implying that decades of experimental IR research used potentially improper methods, which may have produced questionable results. However, it was unclear if and to what extent these findings apply to actual evaluations and this opened a debate in the community with researchers standing on opposite positions about whether this should be considered an issue (or not) and to what extent. In this paper, we first give an introduction to the representational measurement theory explaining why certain operations and significance tests are permissible only with scales of a certain level. For that, we introduce the notion of meaningfulness specifying the conditions under which the truth (or falsity) of a statement is invariant under permissible transformations of a scale. Furthermore, we show how the recall base and the length of the run may make comparison and aggregation across topics problematic. Then we propose a straightforward and powerful approach for turning an evaluation measure into an interval scale, and describe an experimental evaluation of the differences between using the original measures and the interval-scaled ones. For all the regarded measures - namely Precision, Recall, Average Precision, (Normalized) Discounted Cumulative Gain, Rank-Biased Precision and Reciprocal Rank - we observe substantial effects, both on the order of average values and on the outcome of significance tests. For the latter, previously significant differences turn out to be insignificant, while insignificant ones become significant. The effect varies remarkably between the tests considered but overall, on average, we observed a 25% change in the decision about which systems are significantly different and which are not.

中文翻译:

走向IR评价中有意义的陈述。将评估措施映射到区间量表

最近,有研究表明,大多数流行的IR测量方法都不是按比例缩放的,这意味着数十年的IR实验研究使用了可能不正确的方法,这些方法可能产生可疑的结果。然而,目前尚不清楚这些发现是否以及在何种程度上适用于实际评估,这在社区引发了一场辩论,研究人员站在相反的立场上,关于是否应将其视为一个问题以及在什么程度上。在本文中,我们首先介绍代表性测量理论,解释为什么只有在一定水平的标度下才允许进行某些运算和显着性检验。为此,我们引入了有意义性的概念,该概念指定了条件的允许条件,在该条件下,陈述的真实性(或虚假性)在量表的允许变换下是不变的。此外,我们展示了召回基数和运行时间如何使跨主题的比较和汇总成为问题。然后,我们提出了一种简单有效的方法,用于将评估度量转换为区间标度,并描述了对使用原始度量和区间标度之间的差异进行实验评估。对于所有考虑的度量-精度,召回率,平均精度,(归一化)贴现累积增益,有偏偏差的精度和倒数排名-我们观察到均值顺序和显着性检验结果均受到重大影响。对于后者,以前的显着差异原来是微不足道的,而微不足道的差异就变得很重要。在所考虑的测试之间,效果显着不同,但总体而言,
更新日期:2021-01-08
down
wechat
bug