当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Guided Dialog Policy Learning without Adversarial Learning in the Loop
arXiv - CS - Artificial Intelligence Pub Date : 2020-04-07 , DOI: arxiv-2004.03267
Ziming Li, Sungjin Lee, Baolin Peng, Jinchao Li, Julia Kiseleva, Maarten de Rijke, Shahin Shayandeh, Jianfeng Gao

Reinforcement Learning (RL) methods have emerged as a popular choice for training an efficient and effective dialogue policy. However, these methods suffer from sparse and unstable reward signals returned by a user simulator only when a dialogue finishes. Besides, the reward signal is manually designed by human experts, which requires domain knowledge. Recently, a number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy. However, to alternatively update the dialogue policy and the reward model on the fly, we are limited to policy-gradient-based algorithms, such as REINFORCE and PPO. Moreover, the alternating training of a dialogue agent and the reward model can easily get stuck in local optima or result in mode collapse. To overcome the listed issues, we propose to decompose the adversarial training into two steps. First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning. This approach is applicable to both on-policy and off-policy RL methods. Based on our extensive experimentation, we can conclude the proposed method: (1) achieves a remarkable task success rate using both on-policy and off-policy RL methods; and (2) has the potential to transfer knowledge from existing domains to a new domain.

中文翻译:

在循环中没有对抗性学习的引导式对话策略学习

强化学习 (RL) 方法已成为训练高效对话策略的流行选择。然而,这些方法只有在对话结束时才会受到用户模拟器返回的稀疏和不稳定的奖励信号的影响。此外,奖励信号是由人类专家手动设计的,这需要领域知识。最近,已经提出了许多对抗性学习方法来与对话策略一起学习奖励函数。然而,为了交替更新对话策略和奖励模型,我们仅限于基于策略梯度的算法,例如 REINFORCE 和 PPO。此外,对话代理和奖励模型的交替训练很容易陷入局部最优或导致模式崩溃。为了克服列出的问题,我们建议将对抗训练分解为两个步骤。首先,我们用辅助对话生成器训练鉴别器,然后将派生的奖励模型合并到一个通用的 RL 方法中,以指导对话策略学习。这种方法适用于在策略和非策略 RL 方法。基于我们广泛的实验,我们可以得出以下结论:(1)同时使用 on-policy 和 off-policy RL 方法实现了显着的任务成功率;(2) 具有将知识从现有领域转移到新领域的潜力。基于我们广泛的实验,我们可以得出以下结论:(1)同时使用 on-policy 和 off-policy RL 方法实现了显着的任务成功率;(2) 具有将知识从现有领域转移到新领域的潜力。基于我们广泛的实验,我们可以得出以下结论:(1)同时使用 on-policy 和 off-policy RL 方法实现了显着的任务成功率;(2) 具有将知识从现有领域转移到新领域的潜力。
更新日期:2020-09-18
down
wechat
bug