A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints,arXiv - CS - Systems and Control

当前位置： X-MOL 学术 › arXiv.cs.SY › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints
arXiv - CS - Systems and Control Pub Date : 2020-09-23 , DOI: arxiv-2009.11348
Krishna C. Kalagarla, Rahul Jain, Pierluigi Nuzzo

Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems whose objective is to minimize a cost function while satisfying constraints on various cost functions. In this paper, we consider the setting of episodic fixed-horizon CMDPs. We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an $\epsilon$-optimal policy, i.e., with resulting objective value within $\epsilon$ of the optimal value and satisfying the constraints within $\epsilon$-tolerance, with probability at least $1-\delta$. The number of episodes needed is shown to be of the order $\tilde{\mathcal{O}}\big(\frac{|S||A|C^{2}H^{2}}{\epsilon^{2}}\log\frac{1}{\delta}\big)$, where $C$ is the upper bound on the number of possible successor states for a state-action pair. Therefore, if $C \ll |S|$, the number of episodes needed have a linear dependence on the state and action space sizes $|S|$ and $|A|$, respectively, and quadratic dependence on the time horizon $H$.

中文翻译：

具有约束的情景有限视野 MDP 的样本有效算法

约束马尔可夫决策过程 (CMDP) 将顺序决策问题形式化，其目标是在满足对各种成本函数的约束的同时最小化成本函数。在本文中，我们考虑了情节固定水平 CMDP 的设置。我们提出了一种在线算法，该算法利用有限范围 CMDP 的线性规划公式进行重复乐观规划，以对确保 $\epsilon$-最优策略所需的情节数量提供可能近似正确（PAC）的保证，即，最终目标值在最优值的 $\epsilon$ 内，并且满足 $\epsilon$-tolerance 内的约束，概率至少为 $1-\delta$。所需的剧集数显示为 $\tilde{\mathcal{O}}\big(\frac{|S||A|C^{2}H^{2}}{\epsilon^{ 2}}\log\frac{1}{\delta}\big)$, 其中 $C$ 是状态-动作对的可能后继状态数量的上限。因此，如果 $C \ll |S|$，所需的情节数量分别与状态和动作空间大小 $|S|$ 和 $|A|$ 线性相关，并且与时间范围 $ 二次相关美元。

更新日期：2020-09-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>