Two-level Q-learning: learning from conflict demonstrations,The Knowledge Engineering Review

当前位置： X-MOL 学术 › Knowl. Eng. Rev. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Two-level Q-learning: learning from conflict demonstrations
The Knowledge Engineering Review ( IF 2.8 ) Pub Date : 2019-11-12 , DOI: 10.1017/s0269888919000092
Mao Li , Yi Wei , Daniel Kudenko

One way to address this low sample efficiency of reinforcement learning (RL) is to employ human expert demonstrations to speed up the RL process (RL from demonstration or RLfD). The research so far has focused on demonstrations from a single expert. However, little attention has been given to the case where demonstrations are collected from multiple experts, whose expertise may vary on different aspects of the task. In such scenarios, it is likely that the demonstrations will contain conflicting advice in many parts of the state space. We propose a two-level Q-learning algorithm, in which the RL agent not only learns the policy of deciding on the optimal action but also learns to select the most trustworthy expert according to the current state. Thus, our approach removes the traditional assumption that demonstrations come from one single source and are mostly conflict-free. We evaluate our technique on three different domains and the results show that the state-of-the-art RLfD baseline fails to converge or performs similarly to conventional Q-learning. In contrast, the performance level of our novel algorithm increases with more experts being involved in the learning process and the proposed approach has the capability to handle demonstration conflicts well.

中文翻译：

两级 Q 学习：从冲突演示中学习

解决强化学习 (RL) 的这种低样本效率的一种方法是使用人类专家演示来加速 RL 过程（来自演示的 RL 或 RLfD）。迄今为止，这项研究主要集中在一位专家的演示上。然而，很少有人关注从多位专家那里收集演示的情况，他们的专业知识可能因任务的不同方面而异。在这种情况下，演示可能会在状态空间的许多部分包含相互矛盾的建议。我们提出了一种两级 Q 学习算法，其中 RL 智能体不仅学习决策最优动作的策略，而且学习根据当前状态选择最值得信赖的专家。因此，我们的方法消除了传统的假设，即示威活动来自单一来源，并且大多没有冲突。我们在三个不同的领域评估我们的技术，结果表明，最先进的 RLfD 基线无法收敛或表现得与传统的 Q-learning 相似。相比之下，我们的新算法的性能水平随着更多专家参与学习过程而提高，并且所提出的方法能够很好地处理演示冲突。

更新日期：2019-11-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文