当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Active deep Q-learning with demonstration
Machine Learning ( IF 4.3 ) Pub Date : 2019-11-08 , DOI: 10.1007/s10994-019-05849-4
Si-An Chen , Voot Tangkaratt , Hsuan-Tien Lin , Masashi Sugiyama

Reinforcement learning (RL) is a machine learning technique aiming to learn how to take actions in an environment to maximize some kind of reward. Recent research has shown that although the learning efficiency of RL can be improved with expert demonstration, it usually takes considerable efforts to obtain enough demonstration. The efforts prevent training decent RL agents with expert demonstration in practice. In this work, we propose Active Reinforcement Learning with Demonstration, a new framework to streamline RL in terms of demonstration efforts by allowing the RL agent to query for demonstration actively during training. Under the framework, we propose Active deep Q-Network, a novel query strategy based on a classical RL algorithm called deep Q-network (DQN). The proposed algorithm dynamically estimates the uncertainty of recent states and utilizes the queried demonstration data by optimizing a supervised loss in addition to the usual DQN loss. We propose two methods of estimating the uncertainty based on two state-of-the-art DQN models, namely the divergence of bootstrapped DQN and the variance of noisy DQN. The empirical results validate that both methods not only learn faster than other passive expert demonstration methods with the same amount of demonstration and but also reach super-expert level of performance across four different tasks.

中文翻译:

带演示的主动深度 Q 学习

强化学习 (RL) 是一种机器学习技术,旨在学习如何在环境中采取行动以最大化某种奖励。最近的研究表明,虽然可以通过专家演示来提高 RL 的学习效率,但通常需要付出相当大的努力才能获得足够的演示。这些努力阻止了在实践中通过专家演示来训练体面的 RL 代理。在这项工作中,我们提出了带有演示的主动强化学习,这是一种新框架,通过允许 RL 代理在训练期间主动查询演示,从而在演示工作方面简化 RL。在该框架下,我们提出了 Active deep Q-Network,这是一种基于经典 RL 算法的新型查询策略,称为 deep Q-network (DQN)。所提出的算法动态估计最近状态的不确定性,并通过优化除通常的 DQN 损失之外的监督损失来利用查询的演示数据。我们提出了两种基于两种最先进的 DQN 模型来估计不确定性的方法,即自举 DQN 的散度和噪声 DQN 的方差。实证结果验证了这两种方法不仅比具有相同演示量的其他被动专家演示方法学得更快,而且在四个不同的任务中都达到了超级专家级别的性能。即自举 DQN 的散度和噪声 DQN 的方差。实证结果验证了这两种方法不仅比具有相同演示量的其他被动专家演示方法学习得更快,而且在四个不同的任务中都达到了超级专家级别的性能。即自举 DQN 的散度和噪声 DQN 的方差。实证结果验证了这两种方法不仅比具有相同演示量的其他被动专家演示方法学习得更快,而且在四个不同的任务中都达到了超级专家级别的性能。
更新日期:2019-11-08
down
wechat
bug