Learning Parametric Policies and Transition Probability Models of Markov Decision Processes from Data,European Journal of Control

当前位置： X-MOL 学术 › Eur. J. Control › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Parametric Policies and Transition Probability Models of Markov Decision Processes from Data
European Journal of Control ( IF 2.5 ) Pub Date : 2020-05-26 , DOI: 10.1016/j.ejcon.2020.04.003
Tingting Xu ₁ , Henghui Zhu ₁ , Ioannis Ch Paschalidis ₂

Affiliation

We consider the problem of estimating the policy and transition probability model of a Markov Decision Process from data (state, action, next state tuples). The transition probability and policy are assumed to be parametric functions of a sparse set of features associated with the tuples. We propose two regularized maximum likelihood estimation algorithms for learning the transition probability model and policy, respectively. An upper bound is established on the regret, which is the difference between the average reward of the estimated policy under the estimated transition probabilities and that of the original unknown policy under the true (unknown) transition probabilities. We provide a sample complexity result showing that we can achieve a low regret with a relatively small amount of training samples. We illustrate the theoretical results with a healthcare example and a robot navigation experiment.

中文翻译：

从数据中学习马尔可夫决策过程的参数策略和转移概率模型

我们考虑从数据（状态、动作、下一个状态元组）估计马尔可夫决策过程的策略和转移概率模型的问题。假设转移概率和策略是与元组相关的稀疏特征集的参数函数。我们提出了两种正则化最大似然估计算法，分别用于学习转移概率模型和策略。后悔的上限是在估计转移概率下估计策略的平均奖励与真实（未知）转移概率下原始未知策略的平均奖励之间的差值。我们提供了样本复杂度结果，表明我们可以使用相对少量的训练样本实现较低的遗憾。我们通过医疗保健示例和机器人导航实验来说明理论结果。

更新日期：2020-05-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11