当前位置: X-MOL 学术ACM Trans. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PONE
ACM Transactions on Information Systems ( IF 5.6 ) Pub Date : 2020-11-24 , DOI: 10.1145/3423168
Tian Lan 1 , Xian-Ling Mao 1 , Wei Wei 2 , Xiaoyan Gao 1 , Heyan Huang 1
Affiliation  

Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented PO sitive samples and valuable NE gative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines. 1

中文翻译:

电话

开放域生成对话系统在过去几年中引起了相当大的关注。目前,如何自动评估它们仍然是一个很大的挑战。据我们所知,开放域生成对话系统的自动评估分为三种:(1)基于词重叠的度量;(2) 基于嵌入的指标;(3) 基于学习的指标。由于缺乏系统的比较,尚不清楚哪种指标更有效。在本文中,我们首先系统地测量各种指标,以检查哪种指标最好。大量实验表明,基于学习的指标是开放域生成对话系统最有效的评估指标。此外,我们观察到几乎所有基于学习的指标都依赖于负采样机制,它获得了极其不平衡和低质量的样本来训练分数模型。为了解决这个问题,我们提出了一种新的基于学习的度量,通过使用增强的方法显着提高与人类判断的相关性采购订单样品和有价值的网元给定样本,称为 PONE。大量实验表明,PONE 明显优于最先进的基于学习的评估方法。此外,我们已经公开发布了我们提出的度量标准和最先进的基线的代码。1
更新日期:2020-11-24
down
wechat
bug