当前位置: X-MOL 学术IEEE Robot. Automation Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Planning-Augmented Hierarchical Reinforcement Learning
IEEE Robotics and Automation Letters ( IF 5.2 ) Pub Date : 2021-04-05 , DOI: 10.1109/lra.2021.3071062
Robert Gieselmann 1 , Florian T. Pokorny 2
Affiliation  

Planning algorithms are powerful at solving long-horizon decision-making problems but require that environment dynamics are known. Model-free reinforcement learning has recently been merged with graph-based planning to increase the robustness of trained policies in state-space navigation problems. Recent ideas suggest to use planning in order to provide intermediate waypoints guiding the policy in long-horizon tasks. Yet, it is not always practical to describe a problem in the setting of state-to-state navigation. Often, the goal is defined by one or multiple disjoint sets of valid states or implicitly using an abstract task description. Building upon previous efforts, we introduce a novel algorithm called Planning-Augmented Hierarchical Reinforcement Learning (PAHRL) which translates the concept of hybrid planning/RL to such problems with implicitly defined goal. Using a hierarchical framework, we divide the original task, formulated as a Markov Decision Process (MDP), into a hierarchy of shorter horizon MDPs. Actor-critic agents are trained in parallel for each level of the hierarchy. During testing, a planner then determines useful subgoals on a state graph constructed at the bottom level of the hierarchy. The effectiveness of our approach is demonstrated for a set of continuous control problems in simulation including robot arm reaching tasks and the manipulation of a deformable object.

中文翻译:

规划增强层次强化学习

规划算法在解决长期决策问题方面功能强大,但需要了解环境动态。最近,将无模型的强化学习与基于图的计划进行了合并,以提高训练有素的策略在状态空间导航问题中的稳定性。最近的想法建议使用计划,以提供指导中长期任务策略的中间航路点。但是,描述状态到状态导航设置中的问题并不总是可行的。通常,目标是由一个或多个不相交的有效状态集定义的,或者是隐式地使用抽象任务描述来定义的。在以前的努力的基础上,我们介绍了一种称为计划增强层次强化学习(PAHRL)的新颖算法,该算法将混合计划/ RL的概念转换为具有隐式定义目标的此类问题。使用分层框架,我们将制定为马尔可夫决策过程(MDP)的原始任务划分为较短视野的MDP层次结构。行为批评者对于层次结构的每个级别都进行了并行训练。在测试期间,计划人员然后在层次结构的最底层构建的状态图上确定有用的子目标。我们的方法的有效性在仿真中解决了一系列连续控制问题,包括机器人手臂到达任务和可变形物体的操纵。分为较短的MDP层次结构。行为批评者对于层次结构的每个级别都进行了并行训练。在测试期间,计划人员然后在层次结构的最底层构建的状态图上确定有用的子目标。我们的方法的有效性在仿真中解决了一系列连续控制问题,包括机器人手臂到达任务和可变形物体的操纵。分为较短的MDP层次结构。行为批评者对于层次结构的每个级别都进行了并行训练。在测试期间,计划人员然后在层次结构的最底层构建的状态图上确定有用的子目标。我们的方法的有效性在仿真中解决了一系列连续控制问题,包括机器人手臂到达任务和可变形物体的操纵。
更新日期:2021-04-23
down
wechat
bug