Meta-evaluation of Conversational Search Evaluation Metrics,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Meta-evaluation of Conversational Search Evaluation Metrics
arXiv - CS - Information Retrieval Pub Date : 2021-04-27 , DOI: arxiv-2104.13453
Zeyang Liu, Ke Zhou, Max L. Wilson

Conversational search systems, such as Google Assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remains to be investigated. In this paper, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect "actual" performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics varies significantly across different scenarios whereas consistent with prior studies, existing metrics only achieve a weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.

中文翻译：

会话搜索评估指标的元评估

会话搜索系统（例如Google Assistant和Microsoft Cortana）使用户可以通过自然语言对话与搜索系统进行多次交互。鉴于可以生成任何自然语言的响应，并且用户通常会进行多次语义一致的交互以完成搜索任务，因此评估此类系统非常具有挑战性。尽管先前的研究提出了许多评估指标，但是这些措施如何有效地捕获用户的偏好的程度仍有待研究。在本文中，我们系统地对各种会话搜索指标进行了元评估。我们针对这些指标专门研究了三种观点：（1）可靠性：检测“实际”性能差异而不是偶然发现的性能的能力；（2）保真度：同意最终用户偏好的能力；（3）直觉：捕捉被认为重要的任何属性的能力：在对话式搜索中足够，翔实和流利。通过对两个测试集合进行实验，我们发现不同指标的性能在不同情况下差异很大，而与先前的研究一致，现有指标仅与最终用户的偏好和满意度之间存在弱关联。相对而言，METEOR是考虑到这三个方面的最佳现有单圈指标。我们还演示了基于调整后的基于会话的评估指标可用于衡量多回合的会话搜索，从而实现与用户满意度适度的一致。据我们所知，

更新日期：2021-04-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文