当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Rethinking Supervised Learning and Reinforcement Learning in Task-Oriented Dialogue Systems
arXiv - CS - Computation and Language Pub Date : 2020-09-21 , DOI: arxiv-2009.09781
Ziming Li and Julia Kiseleva and Maarten de Rijke

Dialogue policy learning for task-oriented dialogue systems has enjoyed great progress recently mostly through employing reinforcement learning methods. However, these approaches have become very sophisticated. It is time to re-evaluate it. Are we really making progress developing dialogue agents only based on reinforcement learning? We demonstrate how (1)~traditional supervised learning together with (2)~a simulator-free adversarial learning method can be used to achieve performance comparable to state-of-the-art RL-based methods. First, we introduce a simple dialogue action decoder to predict the appropriate actions. Then, the traditional multi-label classification solution for dialogue policy learning is extended by adding dense layers to improve the dialogue agent performance. Finally, we employ the Gumbel-Softmax estimator to alternatively train the dialogue agent and the dialogue reward model without using reinforcement learning. Based on our extensive experimentation, we can conclude the proposed methods can achieve more stable and higher performance with fewer efforts, such as the domain knowledge required to design a user simulator and the intractable parameter tuning in reinforcement learning. Our main goal is not to beat reinforcement learning with supervised learning, but to demonstrate the value of rethinking the role of reinforcement learning and supervised learning in optimizing task-oriented dialogue systems.

中文翻译:

重新思考面向任务的对话系统中的监督学习和强化学习

面向任务的对话系统的对话策略学习最近取得了很大进展,主要是通过采用强化学习方法。然而,这些方法已经变得非常复杂。是时候重新评估它了。我们真的在仅基于强化学习开发对话代理方面取得进展吗?我们展示了如何使用 (1)~传统监督学习和 (2)~一种无模拟器的对抗性学习方法来实现与最先进的基于 RL 的方法相当的性能。首先,我们引入了一个简单的对话动作解码器来预测适当的动作。然后,通过添加密集层来扩展对话策略学习的传统多标签分类解决方案,以提高对话代理的性能。最后,我们使用 Gumbel-Softmax 估计器在不使用强化学习的情况下交替训练对话代理和对话奖励模型。基于我们广泛的实验,我们可以得出结论,所提出的方法可以以更少的努力实现更稳定和更高的性能,例如设计用户模拟器所需的领域知识和强化学习中棘手的参数调整。我们的主要目标不是用监督学习打败强化学习,而是展示重新思考强化学习和监督学习在优化面向任务的对话系统中的作用的价值。我们可以得出结论,所提出的方法可以以更少的努力实现更稳定和更高的性能,例如设计用户模拟器所需的领域知识和强化学习中棘手的参数调整。我们的主要目标不是用监督学习打败强化学习,而是展示重新思考强化学习和监督学习在优化面向任务的对话系统中的作用的价值。我们可以得出结论,所提出的方法可以以更少的努力实现更稳定和更高的性能,例如设计用户模拟器所需的领域知识和强化学习中棘手的参数调整。我们的主要目标不是用监督学习打败强化学习,而是展示重新思考强化学习和监督学习在优化面向任务的对话系统中的作用的价值。
更新日期:2020-09-22
down
wechat
bug