A Convex Programming Approach for Discrete-Time Markov Decision Processes under the Expected Total Reward Criterion,SIAM Journal on Control and Optimization

当前位置： X-MOL 学术 › SIAM J. Control Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Convex Programming Approach for Discrete-Time Markov Decision Processes under the Expected Total Reward Criterion
SIAM Journal on Control and Optimization ( IF 2.2 ) Pub Date : 2020-08-25 , DOI: 10.1137/19m1255811
François Dufour , Alexandre Genadot

SIAM Journal on Control and Optimization, Volume 58, Issue 4, Page 2535-2566, January 2020.
In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDP. It will be shown that the values of the constrained control problem and the associated convex program coincide. Moreover, if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the ETRs over the set of randomized policies is equal to the supremum of the ETRs over the set of stationary randomized policies. We consider standard hypotheses such as the so-called continuity-compactness conditions and a Slater-type condition. Our assumptions are quite weak to deal with cases that have not yet been addressed in the literature. Examples are presented to illustrate our results.

中文翻译：

期望总奖励标准下离散马尔可夫决策过程的凸规划方法

SIAM控制与优化杂志，第58卷，第4期，第2535-2566页，2020年1月。
在这项工作中，我们研究具有Borel状态和动作空间的约束下的离散时间Markov决策过程（MDP），其中所有性能函数在无限时间范围内具有相同形式的预期总奖励（ETR）准则。我们的目标之一是针对此类MDP提出凸编程方案。将显示约束控制问题的值与关联的凸程序的值一致。此外，如果存在凸程序的最佳解决方案，则存在对于MDP最佳的平稳随机策略。还将表明，在约束控制问题的框架内，随机策略集上的ETR的最大值等于固定随机策略集上的ETR的最大值。我们考虑标准假设，例如所谓的连续性紧致条件和Slater型条件。对于处理文献中尚未解决的案例，我们的假设非常薄弱。举例说明了我们的结果。

更新日期：2020-08-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11