Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations,arXiv - CS - Systems and Control

当前位置： X-MOL 学术 › arXiv.cs.SY › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations
arXiv - CS - Systems and Control Pub Date : 2021-06-22 , DOI: arxiv-2106.11519
Christoph Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik Sridharan

There have been many recent advances on provably efficient Reinforcement Learning (RL) in problems with rich observation spaces. However, all these works share a strong realizability assumption about the optimal value function of the true MDP. Such realizability assumptions are often too strong to hold in practice. In this work, we consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $\Pi$ that may not contain any near-optimal policy. We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP. Specifically, our algorithm enjoys a sample complexity bound of $\widetilde{O}\left((H^{4d} K^{3d} \log |\Pi|)/\epsilon^2\right)$ where $H$ is the length of episodes, $K$ is the number of actions and $\epsilon>0$ is the desired sub-optimality. We also provide a nearly matching lower bound for this agnostic setting that shows that the exponential dependence on rank is unavoidable, without further assumptions.

中文翻译：

具有低秩 MDP 和丰富观察的不可知强化学习

在具有丰富观察空间的问题上，最近在可证明有效的强化学习 (RL) 方面取得了许多进展。然而，所有这些工作都对真实 MDP 的最优价值函数共享一个强大的可实现性假设。这种可实现性假设在实践中往往过于强大而无法成立。在这项工作中，我们考虑了更现实的不可知 RL 设置，具有丰富的观察空间和固定类别的策略 $\Pi$，可能不包含任何接近最优的策略。我们为此设置提供了一种算法，其误差以基础 MDP 的等级 $d$ 为界。具体来说，我们的算法享有 $\widetilde{O}\left((H^{4d} K^{3d} \log |\Pi|)/\epsilon^2\right)$ 的样本复杂度界限，其中 $H$是剧集的长度，$K$ 是动作的数量，$\epsilon>0$ 是所需的次优。

更新日期：2021-06-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文