On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation,Information Processing & Management

当前位置： X-MOL 学术 › Inf. Process. Manag. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation
Information Processing & Management ( IF 7.4 ) Pub Date : 2021-07-28 , DOI: 10.1016/j.ipm.2021.102688
Kevin Roitero ₁ , Eddy Maddalena ₂ , Stefano Mizzaro ₁ , Falk Scholer ₃

Affiliation

Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of information retrieval systems. Our work analyzes and compares four crowdsourced scales (2-levels, 4-levels, and 100-levels ordinal scales, and a magnitude estimation scale) and two expert-labeled datasets (on 2- and 4-levels ordinal scales). We compare the scales considering internal and external agreement, the effect on IR evaluation both in terms of system effectiveness and topic ease, and we discuss the effect of such scales and datasets on the perception of relevance levels by assessors.

Our analyses show that: crowdsourced judgment distributions are consistent across scales, both overall and at the per-topic level; on all scales crowdsourced judgments agree with the expert judgments, and overall the crowd assessors are able to express reliable relevance judgments; all scales lead to a similar level of external agreement with the ground truth, while the internal agreement among crowd workers is higher for fine-grained scales; more fine-grained scales consistently lead to higher correlation values for both system ranking and topic ease; finally, we found that the considered scales lead to different perceived distances between relevance levels.

中文翻译：

众包相关性评估中相关量表对信息检索评估的影响

相关性是信息检索中的一个关键概念，广泛用于使用测试集合评估搜索系统。我们全面研究了相关量表的选择对信息检索系统评估的影响。我们的工作分析和比较了四个众包量表（2 级、4 级和 100 级序数量表，以及幅度估计量表）和两个专家标记的数据集（2 级和 4 级序数量表）。我们比较了考虑内部和外部一致性的量表，在系统有效性和主题容易度方面对 IR 评估的影响，我们讨论了这些量表和数据集对评估者对相关级别感知的影响。

我们的分析表明：众包判断分布在各个尺度上都是一致的，无论是在整体层面还是在每个主题层面；在所有尺度上，众包判断都与专家判断一致，总体而言，众包评估者能够表达可靠的相关性判断；所有尺度都导致与基本事实相似的外部一致性水平，而细粒度尺度的众包工作者之间的内部一致性更高；更细粒度的尺度始终导致系统排名和主题轻松度的更高相关值；最后，我们发现所考虑的尺度导致相关级别之间的感知距离不同。

更新日期：2021-07-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11