Greedy Multi-step Off-Policy Reinforcement Learning,arXiv - CS - Multiagent Systems

当前位置： X-MOL 学术 › arXiv.cs.MA › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Greedy Multi-step Off-Policy Reinforcement Learning
arXiv - CS - Multiagent Systems Pub Date : 2021-02-23 , DOI: arxiv-2102.11717
Yuhui Wang, Pengcheng He, Xiaoyang Tan

Multi-step off-policy reinforcement learning has achieved great success. However, existing multi-step methods usually impose a fixed prior on the bootstrap steps, while the off-policy methods often require additional correction, suffering from certain undesired effects. In this paper, we propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps. The new method has two desired properties:1) it can flexibly adjust the bootstrap step based on the quality of the data and the learned value function; 2) it can safely and robustly utilize data from arbitrary behavior policy without additional correction, whatever its quality or "off-policyness". We analyze the theoretical properties of the related operator, showing that it is able to converge to the global optimal value function, with a ratio faster than the traditional Bellman Optimality Operator. Furthermore, based on this new operator, we derive new model-free RL algorithms named Greedy Multi-Step Q Learning (and Greedy Multi-step DQN). Experiments reveal that the proposed methods are reliable, easy to implement, and achieve state-of-the-art performance on a series of standard benchmark datasets.

中文翻译：

贪婪的多步非政策强化学习

多步非政策强化学习取得了巨大的成功。但是，现有的多步骤方法通常在引导步骤上施加固定的先验，而脱离策略的方法通常需要进行其他更正，这会带来某些不良影响。在本文中，我们提出了一种新颖的自举方法，该方法以不同的步长贪婪地获取自举值中的最大值。新方法具有两个期望的特性：1）它可以根据数据质量和学习值函数灵活地调整引导步骤；2）它可以安全，可靠地利用任意行为策略中的数据，而无需对其进行任何质量或“非政策性”调整。我们分析了相关算子的理论性质，表明它能够收敛到全局最优值函数，并且比传统的Bellman最优性算子要快。此外，基于此新运算符，我们得出了名为Greedy Multi-step Q Learning（和Greedy Multi-step DQN）的新的无模型RL算法。实验表明，所提出的方法可靠，易于实施，并且可以在一系列标准基准数据集上达到最先进的性能。

更新日期：2021-02-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文