Entropy Maximization for Markov Decision Processes Under Temporal Logic Constraints,IEEE Transactions on Automatic Control

当前位置： X-MOL 学术 › IEEE Trans. Autom. Control › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Entropy Maximization for Markov Decision Processes Under Temporal Logic Constraints
IEEE Transactions on Automatic Control ( IF 6.8 ) Pub Date : 2020-04-01 , DOI: 10.1109/tac.2019.2922583
Yagiz Savas , Melkior Ornik , Murat Cubuktepe , Mustafa O. Karabag , Ufuk Topcu

We study the problem of synthesizing a policy that maximizes the entropy of a Markov decision process (MDP) subject to a temporal logic constraint. Such a policy minimizes the predictability of the paths it generates, or dually, maximizes the exploration of different paths in an MDP while ensuring the satisfaction of a temporal logic specification. We first show that the maximum entropy of an MDP can be finite, infinite, or unbounded. We provide necessary and sufficient conditions under which the maximum entropy of an MDP is finite, infinite, or unbounded. We then present an algorithm which is based on a convex optimization problem to synthesize a policy that maximizes the entropy of an MDP. We also show that maximizing the entropy of an MDP is equivalent to maximizing the entropy of the paths that reach a certain set of states in the MDP. Finally, we extend the algorithm to an MDP subject to a temporal logic specification. In numerical examples, we demonstrate the proposed method on different motion planning scenarios and illustrate the relation between the restrictions imposed on the paths by a specification, the maximum entropy, and the predictability of paths.

中文翻译：

时间逻辑约束下马尔可夫决策过程的熵最大化

我们研究了综合策略的问题，该策略可在时间逻辑约束下最大化马尔可夫决策过程 (MDP) 的熵。这样的策略最小化了它生成的路径的可预测性，或者双重地最大化了对 MDP 中不同路径的探索，同时确保满足时间逻辑规范。我们首先证明 MDP 的最大熵可以是有限的、无限的或无界的。我们提供了必要和充分条件，在这些条件下，MDP 的最大熵是有限的、无限的或无界的。然后，我们提出了一种基于凸优化问题的算法来合成最大化 MDP 熵的策略。我们还表明，最大化 MDP 的熵等效于最大化到达 MDP 中特定状态集的路径的熵。最后，我们将算法扩展到受时间逻辑规范约束的 MDP。在数值例子中，我们在不同的运动规划场景中演示了所提出的方法，并说明了规范对路径施加的限制、最大熵和路径的可预测性之间的关系。

更新日期：2020-04-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>