当前位置:
X-MOL 学术
›
arXiv.cs.FL
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Reinforcement Learning Based Temporal Logic Control with Soft Constraints Using Limit-deterministic Büchi Automata
arXiv - CS - Formal Languages and Automata Theory Pub Date : 2021-01-25 , DOI: arxiv-2101.10284 Mingyu Cai, Shaoping Xiao, Zhen Kan
arXiv - CS - Formal Languages and Automata Theory Pub Date : 2021-01-25 , DOI: arxiv-2101.10284 Mingyu Cai, Shaoping Xiao, Zhen Kan
This paper studies the control synthesis of motion planning subject to
uncertainties. The uncertainties are considered in robot motion and environment
properties, giving rise to the probabilistic labeled Markov decision process
(MDP). A model-free reinforcement learning (RL) is developed to generate a
finite-memory control policy to satisfy high-level tasks expressed in linear
temporal logic (LTL) formulas. One of the novelties is to translate LTL into a
limit deterministic generalized B\"uchi automaton (LDGBA) and develop a
corresponding embedded LDGBA (E-LDGBA) by incorporating a tracking-frontier
function to overcome the issue of sparse accepting rewards, resulting in
improved learning performance without increasing computational complexity. Due
to potentially conflicting tasks, a relaxed product MDP is developed to allow
the agent to revise its motion plan without strictly following the desired LTL
constraints if the desired tasks can only be partially fulfilled. An expected
return composed of violation rewards and accepting rewards is developed. The
designed violation function quantifies the differences between the revised and
the desired motion planning, while the accepting rewards are designed to
enforce the satisfaction of the acceptance condition of the relaxed product
MDP. Rigorous analysis shows that any RL algorithm that optimizes the expected
return is guaranteed to find policies that, in decreasing order, can 1) satisfy
acceptance condition of relaxed product MDP and 2) reduce the violation cost
over long-term behaviors. Also, we validate the control synthesis approach via
simulation and experimental results.
中文翻译:
极限确定性Büchi自动机基于软性约束的基于强化学习的时间逻辑控制
本文研究了不确定性条件下运动计划的控制综合。在机器人运动和环境属性中考虑了不确定性,从而产生了概率标记的马尔可夫决策过程(MDP)。开发了无模型强化学习(RL)以生成有限内存控制策略,以满足以线性时间逻辑(LTL)公式表示的高级任务。一种新奇的方法是将LTL转换为极限确定的广义Buchi自动机(LDGBA),并通过合并跟踪边界函数来克服稀疏接受奖励的问题,从而开发相应的嵌入式LDGBA(E-LDGBA),从而在不增加计算复杂度的情况下提高了学习效果。由于任务可能相互冲突,如果只能部分完成所需任务,则开发轻松的产品MDP以允许代理修改其运动计划,而不必严格遵循所需的LTL约束。制定了由违规奖励和接受奖励组成的预期回报。设计的违规功能量化了修订后的运动计划和期望的运动计划之间的差异,而接受奖励旨在增强对轻松产品MDP接受条件的满足。严格的分析表明,保证优化预期收益的任何RL算法都可以找到以降序排列的策略,这些策略可以:1)满足宽松产品MDP的接受条件; 2)降低长期行为的违规成本。此外,我们通过仿真和实验结果验证了控制综合方法。
更新日期:2021-01-26
中文翻译:
极限确定性Büchi自动机基于软性约束的基于强化学习的时间逻辑控制
本文研究了不确定性条件下运动计划的控制综合。在机器人运动和环境属性中考虑了不确定性,从而产生了概率标记的马尔可夫决策过程(MDP)。开发了无模型强化学习(RL)以生成有限内存控制策略,以满足以线性时间逻辑(LTL)公式表示的高级任务。一种新奇的方法是将LTL转换为极限确定的广义Buchi自动机(LDGBA),并通过合并跟踪边界函数来克服稀疏接受奖励的问题,从而开发相应的嵌入式LDGBA(E-LDGBA),从而在不增加计算复杂度的情况下提高了学习效果。由于任务可能相互冲突,如果只能部分完成所需任务,则开发轻松的产品MDP以允许代理修改其运动计划,而不必严格遵循所需的LTL约束。制定了由违规奖励和接受奖励组成的预期回报。设计的违规功能量化了修订后的运动计划和期望的运动计划之间的差异,而接受奖励旨在增强对轻松产品MDP接受条件的满足。严格的分析表明,保证优化预期收益的任何RL算法都可以找到以降序排列的策略,这些策略可以:1)满足宽松产品MDP的接受条件; 2)降低长期行为的违规成本。此外,我们通过仿真和实验结果验证了控制综合方法。