Modular Deep Reinforcement Learning for Continuous Motion Planning with Temporal Logic,arXiv - CS - Formal Languages and Automata Theory

当前位置： X-MOL 学术 › arXiv.cs.FL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Modular Deep Reinforcement Learning for Continuous Motion Planning with Temporal Logic
arXiv - CS - Formal Languages and Automata Theory Pub Date : 2021-02-24 , DOI: arxiv-2102.12855
Mingyu Cai, Mohammadhosein Hasanbeig, Shaoping Xiao, Alessandro Abate, Zhen Kan

This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP) with unknown transition probabilities over continuous state and action spaces. Linear temporal logic (LTL) is used to specify high-level tasks over infinite horizon, which can be converted into a limit deterministic generalized B\"uchi automaton (LDGBA) with several accepting sets. The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP by incorporating a synchronous tracking-frontier function to record unvisited accepting sets of the automaton, and to facilitate the satisfaction of the accepting conditions. The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states and can overcome the issues of sparse rewards. Rigorous analysis shows that any RL method that optimizes the expected discounted return is guaranteed to find an optimal policy whose traces maximize the satisfaction probability. A modular deep deterministic policy gradient (DDPG) is then developed to generate such policies over continuous state and action spaces. The performance of our framework is evaluated via an array of OpenAI gym environments.

中文翻译：

具有时间逻辑的连续运动计划的模块化深度强化学习

本文研究了由马尔可夫决策过程（MDP）建模的自治动力学系统的运动计划，该过程在连续状态和动作空间上具有未知的转移概率。线性时间逻辑（LTL）用于指定无限范围内的高级任务，可以将其转换为具有多个接受集的极限确定性广义Buchuch自动机（LDGBA）。新颖之处在于设计嵌入式产品MDP（ LDGBA和MDP之间的EP-MDP），方法是通过同步跟踪前沿功能记录未访问的自动机接受集，并促进满足接受条件。无强化学习（RL）仅取决于EP-MDP状态，并且可以克服稀疏奖励的问题。严格的分析表明，可以确保优化预期折现收益率的任何RL方法都能找到一条最优策略，其跟踪结果可以使满足概率最大化。然后开发模块化的深度确定性策略梯度（DDPG），以在连续状态和动作空间上生成此类策略。我们的框架的性能通过一系列OpenAI体育馆环境进行评估。

更新日期：2021-02-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>