Introspective Q-learning and learning from demonstration,The Knowledge Engineering Review

当前位置： X-MOL 学术 › Knowl. Eng. Rev. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Introspective Q-learning and learning from demonstration
The Knowledge Engineering Review ( IF 2.8 ) Pub Date : 2019-07-12 , DOI: 10.1017/s0269888919000031
Mao Li , Tim Brys , Daniel Kudenko

One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.

中文翻译：

内省的 Q 学习和从示范中学习

强化学习 (RL) 代理面临的一个挑战是，在许多环境中，奖励信号是稀疏的，导致代理在早期学习阶段的性能提升缓慢。基于潜力的奖励塑造可以通过将专家的领域知识通过潜在功能整合到学习中来帮助解决上述稀疏奖励的问题。过去关于演示强化学习 (RLfD) 的工作直接将（次优）人类专家演示映射到潜在函数，这可以加速 RL。在本文中，我们提出了一种内省的 RL 代理，可以显着进一步加快学习速度。内省的 RL 代理在优先级队列中记录学习期间的状态-动作决策和经验。根据蒙特卡洛估计，高质量的决策将保留在队列中，而更糟糕的决定将被拒绝。然后将队列用作通过奖励塑造来加速 RL 的演示。人类专家的演示可用于在学习过程开始之前初始化优先级队列。在 4 维 CartPole 域和 27 维 Super Mario AI 域中的实验验证表明，我们的方法在两个域中都显着优于非内省 RL 和 RLfD 中的最先进方法。

更新日期：2019-07-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文