当前位置: X-MOL 学术ACM Trans. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Retrieval Evaluation Measures that Agree with Users’ SERP Preferences
ACM Transactions on Information Systems ( IF 5.4 ) Pub Date : 2020-12-31 , DOI: 10.1145/3431813
Tetsuya Sakai 1 , Zhaohao Zeng 1
Affiliation  

We examine the “goodness” of ranked retrieval evaluation measures in terms of how well they align with users’ Search Engine Result Page (SERP) preferences for web search. The SERP preferences cover 1,127 topic-SERP-SERP triplets extracted from the NTCIR-9 INTENT task, reflecting the views of 15 different assessors. Each assessor made two SERP preference judgements for each triplet: one in terms of relevance and the other in terms of diversity. For each evaluation measure, we compute the Agreement Rate (AR) of each triplet: the proportion of assessors that agree with the measure’s SERP preference. We then compare the mean ARs of the measures as well as those of best/median/worst assessors using Tukey HSD tests. Our first experiment compares traditional ranked retrieval measures based on the SERP relevance preferences: we find that normalised Discounted Cumulative Gain (nDCG) and intentwise Rank-biased Utility (iRBU) perform best in that they are the only measures that are statistically indistinguishable from our best assessor; nDCG also statistically significantly outperforms our median assessor. Our second experiment utilises 119,646 document preferences that we collected for a subset of the above topic-SERP-SERP triplets (containing 894 triplets) to compare preference-based evaluation measures as well as traditional ones. Again, we evaluate them based on the SERP relevance preferences. The results suggest that measures such as wpref5 are the most promising among the preference-based measures considered, although they underperform the best traditional measures such as nDCG on average. Our third experiment compares diversified search measures based on the SERP diversity preferences as well as the SERP relevance preferences, and it shows that D♯-measures are clearly the most reliable: in particular, D♯-nDCG and D♯-RBP statistically significantly outperform the median assessor and all intent-aware measures; they also outperform the recently proposed RBU on average. Also, in terms of agreement with SERP diversity preferences, D♯-nDCG statistically significantly outperforms RBU. Hence, if IR researchers want to use evaluation measures that align well with users’ SERP preferences, then we recommend nDCG and iRBU for traditional search, and D♯-measures such as D♯-nDCG for diversified search. As for document preference-based measures that we have examined, we do not have a strong reason to recommended them over traditional measures like nDCG, since they align slightly less well with users’ SERP preferences despite their quadratic assessment cost.

中文翻译:

符合用户SERP偏好的检索评价措施

我们根据它们与用户的搜索引擎结果页面 (SERP) 对网络搜索的偏好的匹配程度来检查排名检索评估度量的“优点”。SERP 偏好涵盖从 NTCIR-9 INTENT 任务中提取的 1,127 个主题-SERP-SERP 三元组,反映了 15 位不同评估者的观点。每个评估员对每个三元组做出两个 SERP 偏好判断:一个在相关性方面,另一个在多样性方面。对于每个评估度量,我们计算每个三元组的一致率(AR):同意度量的 SERP 偏好的评估者的比例。然后,我们使用 Tukey HSD 测试比较测量的平均 AR 以及最佳/中值/最差评估者的平均 AR。我们的第一个实验比较了基于 SERP 相关偏好的传统排名检索度量:我们发现归一化折现累积增益 (nDCG) 和意向排名偏向效用 (iRBU) 表现最好,因为它们是唯一与我们的最佳评估者在统计上无法区分的度量;nDCG 在统计上也显着优于我们的中值评估者。我们的第二个实验利用我们为上述主题-SERP-SERP 三元组(包含 894 个三元组)的一个子集收集的 119,646 个文档偏好来比较基于偏好的评估措施以及传统的评估措施。同样,我们根据 SERP 相关偏好对它们进行评估。结果表明,在所考虑的基于偏好的措施中,诸如 wpref5 之类的措施是最有希望的,尽管它们的平均表现不如最佳传统措施,例如 nDCG。我们的第三个实验比较了基于 SERP 多样性偏好和 SERP 相关偏好的多样化搜索措施,结果表明 D♯-measures 显然是最可靠的:特别是 D♯-nDCG 和 D♯-RBP 在统计上显着优于中值评估者和所有意图感知措施;它们的平均表现也优于最近提出的 RBU。此外,就与 SERP 多样性偏好的一致性而言,D♯-nDCG 在统计上显着优于 RBU。因此,如果 IR 研究人员想要使用与用户的 SERP 偏好非常吻合的评估措施,那么我们推荐 nDCG 和 iRBU 用于传统搜索,而 D♯-measures 如 D♯-nDCG 用于多样化搜索。至于我们研究过的基于文档偏好的措施,
更新日期:2020-12-31
down
wechat
bug