当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning Abstract Models for Strategic Exploration and Fast Reward Transfer
arXiv - CS - Artificial Intelligence Pub Date : 2020-07-12 , DOI: arxiv-2007.05896
Evan Zheran Liu, Ramtin Keramati, Sudarshan Seshadri, Kelvin Guu, Panupong Pasupat, Emma Brunskill, Percy Liang

Model-based reinforcement learning (RL) is appealing because (i) it enables planning and thus more strategic exploration, and (ii) by decoupling dynamics from rewards, it enables fast transfer to new reward functions. However, learning an accurate Markov Decision Process (MDP) over high-dimensional states (e.g., raw pixels) is extremely challenging because it requires function approximation, which leads to compounding errors. Instead, to avoid compounding errors, we propose learning an abstract MDP over abstract states: low-dimensional coarse representations of the state (e.g., capturing agent position, ignoring other objects). We assume access to an abstraction function that maps the concrete states to abstract states. In our approach, we construct an abstract MDP, which grows through strategic exploration via planning. Similar to hierarchical RL approaches, the abstract actions of the abstract MDP are backed by learned subpolicies that navigate between abstract states. Our approach achieves strong results on three of the hardest Arcade Learning Environment games (Montezuma's Revenge, Pitfall!, and Private Eye), including superhuman performance on Pitfall! without demonstrations. After training on one task, we can reuse the learned abstract MDP for new reward functions, achieving higher reward in 1000x fewer samples than model-free methods trained from scratch.

中文翻译:

学习用于战略探索和快速奖励转移的抽象模型

基于模型的强化学习 (RL) 很有吸引力,因为 (i) 它支持规划,从而实现更具战略性的探索,以及 (ii) 通过将动态与奖励分离,它可以快速转移到新的奖励功能。然而,在高维状态(例如,原始像素)上学习准确的马尔可夫决策过程 (MDP) 极具挑战性,因为它需要函数逼近,这会导致复合错误。相反,为了避免复合错误,我们建议在抽象状态上学习抽象 MDP:状态的低维粗略表示(例如,捕获代理位置,忽略其他对象)。我们假设可以访问将具体状态映射到抽象状态的抽象函数。在我们的方法中,我们构建了一个抽象的 MDP,它通过规划的战略探索而增长。与分层 RL 方法类似,抽象 MDP 的抽象动作由在抽象状态之间导航的学习子策略支持。我们的方法在三款最难的街机学习环境游戏(Montezuma's Revenge、Pitfall! 和 Private Eye)上取得了不错的成绩,包括在 Pitfall 上的超人表现!没有示威。在对一项任务进行训练后,我们可以将学习到的抽象 MDP 重用于新的奖励函数,与从头开始训练的无模型方法相比,在 1000 倍的样本中获得更高的奖励。包括在《陷阱》上的超人表演!没有示威。在对一项任务进行训练后,我们可以将学习到的抽象 MDP 重用于新的奖励函数,与从头开始训练的无模型方法相比,在 1000 倍的样本中获得更高的奖励。包括在《陷阱》上的超人表演!没有示威。在对一项任务进行训练后,我们可以将学习到的抽象 MDP 重用于新的奖励函数,与从头开始训练的无模型方法相比,在 1000 倍的样本中获得更高的奖励。
更新日期:2020-07-14
down
wechat
bug