当前位置: X-MOL 学术arXiv.cs.SY › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Generalized B\"uchi Automata
arXiv - CS - Systems and Control Pub Date : 2020-01-14 , DOI: arxiv-2001.04669
Ryohei Oura, Ami Sakakibara, Toshimitsu Ushio

This letter proposes a novel reinforcement learning method for the synthesis of a control policy satisfying a control specification described by a linear temporal logic formula. We assume that the controlled system is modeled by a Markov decision process (MDP). We convert the specification to a limit-deterministic generalized B\"uchi automaton (LDGBA) with several accepting sets that accepts all infinite sequences satisfying the formula. The LDGBA is augmented so that it explicitly records the previous visits to accepting sets. We take a product of the augmented LDGBA and the MDP, based on which we define a reward function. The agent gets rewards whenever state transitions are in an accepting set that has not been visited for a certain number of steps. Consequently, sparsity of rewards is relaxed and optimal circulations among the accepting sets are learned. We show that the proposed method can learn an optimal policy when the discount factor is sufficiently close to one.

中文翻译:

使用极限确定性广义 B\"uchi 自动机对线性时序逻辑规范的控制策略进行强化学习

这封信提出了一种新的强化学习方法,用于合成满足线性时序逻辑公式描述的控制规范的控制策略。我们假设受控系统是由马尔可夫决策过程 (MDP) 建模的。我们将规范转换为具有多个接受集的极限确定性广义 B\"uchi 自动机 (LDGBA),该接受集接受满足该公式的所有无限序列。LDGBA 被扩充,以便它明确记录对接受集的先前访问。我们取增强的 LDGBA 和 MDP 的乘积,我们在此基础上定义了一个奖励函数。每当状态转换处于一个接受集合中并且在一定数量的步骤中未被访问时,代理就会获得奖励。因此,奖励的稀疏性得到放松,并学习了接受集之间的最佳循环。我们表明,当折扣因子足够接近 1 时,所提出的方法可以学习最佳策略。
更新日期:2020-03-27
down
wechat
bug