Shades of BLEU, Flavours of Success: The Case of MultiWOZ,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Shades of BLEU, Flavours of Success: The Case of MultiWOZ
arXiv - CS - Computation and Language Pub Date : 2021-06-10 , DOI: arxiv-2106.05555
Tomáš Nekvinda, Ondřej Dušek

The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarking context-to-response abilities of task-oriented dialogue systems. In this work, we identify inconsistencies in data preprocessing and reporting of three corpus-based metrics used on this dataset, i.e., BLEU score and Inform & Success rates. We point out a few problems of the MultiWOZ benchmark such as unsatisfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database. We re-evaluate 7 end-to-end and 6 policy optimization models in as-fair-as-possible setups, and we show that their reported scores cannot be directly compared. To facilitate comparison of future systems, we release our stand-alone standardized evaluation scripts. We also give basic recommendations for corpus-based benchmarking in future works.

中文翻译：

BLEU 的阴影，成功的味道：MultiWOZ 的案例

MultiWOZ 数据集（Budzianowski 等人，2018 年）经常用于对面向任务的对话系统的上下文响应能力进行基准测试。在这项工作中，我们确定了在该数据集上使用的三个基于语料库的指标的数据预处理和报告不一致，即 BLEU 分数和通知和成功率。我们指出了 MultiWOZ 基准测试的一些问题，例如预处理不令人满意、评估指标不足或指定不足或数据库僵化。我们在尽可能公平的设置中重新评估了 7 个端到端模型和 6 个策略优化模型，并且我们表明它们报告的分数无法直接比较。为了便于比较未来的系统，我们发布了独立的标准化评估脚本。我们还为未来工作中基于语料库的基准测试提供了基本建议。

更新日期：2021-06-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文