当前位置: X-MOL 学术Auton. Agent. Multi-Agent Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Goal-driven active learning
Autonomous Agents and Multi-Agent Systems ( IF 2.0 ) Pub Date : 2021-08-16 , DOI: 10.1007/s10458-021-09527-5
Nicolas Bougie 1, 2 , Ryutaro Ichise 1, 2
Affiliation  

Deep reinforcement learning methods have achieved significant successes in complex decision-making problems. In fact, they traditionally rely on well-designed extrinsic rewards, which limits their applicability to many real-world tasks where rewards are naturally sparse. While cloning behaviors provided by an expert is a promising approach to the exploration problem, learning from a fixed set of demonstrations may be impracticable due to lack of state coverage or distribution mismatch—when the learner’s goal deviates from the demonstrated behaviors. Besides, we are interested in learning how to reach a wide range of goals from the same set of demonstrations. In this work we propose a novel goal-conditioned method that leverages very small sets of goal-driven demonstrations to massively accelerate the learning process. Crucially, we introduce the concept of active goal-driven demonstrations to query the demonstrator only in hard-to-learn and uncertain regions of the state space. We further present a strategy for prioritizing sampling of goals where the disagreement between the expert and the policy is maximized. We evaluate our method on a variety of benchmark environments from the Mujoco domain. Experimental results show that our method outperforms prior imitation learning approaches in most of the tasks in terms of exploration efficiency and average scores.



中文翻译:

目标驱动的主动学习

深度强化学习方法在复杂的决策问题中取得了重大成功。事实上,它们传统上依赖于精心设计的外在奖励,这限制了它们对许多奖励自然稀疏的现实世界任务的适用性。虽然专家提供的克隆行为是解决探索问题的一种很有前途的方法,但由于缺乏状态覆盖或分布不匹配——当学习者的目标偏离所展示的行为时,从一组固定的示范中学习可能是不切实际的。此外,我们有兴趣学习如何从同一组演示中实现广泛的目标。在这项工作中,我们提出了一种新颖的目标条件方法,该方法利用非常小的目标驱动演示集来大规模加速学习过程。至关重要的是,我们引入了主动目标驱动演示的概念,以仅在状态空间的难以学习和不确定的区域中查询演示者。我们进一步提出了一种对目标采样进行优先级排序的策略,其中专家和策略之间的分歧最大。我们在 Mujoco 领域的各种基准环境中评估我们的方法。实验结果表明,我们的方法在大多数任务中在探索效率和平均分数方面优于先前的模仿学习方法。我们在 Mujoco 领域的各种基准环境中评估我们的方法。实验结果表明,我们的方法在大多数任务中在探索效率和平均分数方面优于先前的模仿学习方法。我们在 Mujoco 领域的各种基准环境中评估我们的方法。实验结果表明,我们的方法在大多数任务中在探索效率和平均分数方面优于先前的模仿学习方法。

更新日期:2021-08-19
down
wechat
bug