The Importance of Pessimism in Fixed-Dataset Policy Optimization,arXiv - CS - Artificial Intelligence

当前位置： X-MOL 学术 › arXiv.cs.AI › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The Importance of Pessimism in Fixed-Dataset Policy Optimization
arXiv - CS - Artificial Intelligence Pub Date : 2020-09-15 , DOI: arxiv-2009.06799
Jacob Buckman, Carles Gelada, Marc G. Bellemare

We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms. Our core contribution is a unified conceptual and mathematical framework for the study of algorithms in this regime. This analysis reveals that for naive approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement: in order to guarantee that we select a policy which is near-optimal, we may need the dataset to be informative of the value of every policy. To avoid this, algorithms can follow the pessimism principle, which states that we should choose the policy which acts optimally in the worst possible world. We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy, and derive families of algorithms which follow this principle. These theoretical findings are validated by experiments on a tabular gridworld, and deep learning experiments on four MinAtar environments.

中文翻译：

悲观主义在固定数据集策略优化中的重要性

我们研究了固定数据集策略优化算法的预期回报的最坏情况保证。我们的核心贡献是一个统一的概念和数学框架，用于研究该制度下的算法。该分析表明，对于幼稚的方法，错误高估价值的可能性导致难以满足的要求：为了保证我们选择接近最优的策略，我们可能需要数据集提供价值的信息每一项政策。为了避免这种情况，算法可以遵循悲观原则，即我们应该选择在最糟糕的世界中表现最佳的策略。我们展示了为什么即使数据集不能提供每个策略的信息，悲观算法也能取得良好的性能，并推导出遵循这一原则的算法系列。

更新日期：2020-10-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文