当前位置: X-MOL 学术arXiv.cs.MA › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Greedy Multi-step Off-Policy Reinforcement Learning
arXiv - CS - Multiagent Systems Pub Date : 2021-02-23 , DOI: arxiv-2102.11717
Yuhui Wang, Pengcheng He, Xiaoyang Tan

Multi-step off-policy reinforcement learning has achieved great success. However, existing multi-step methods usually impose a fixed prior on the bootstrap steps, while the off-policy methods often require additional correction, suffering from certain undesired effects. In this paper, we propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps. The new method has two desired properties:1) it can flexibly adjust the bootstrap step based on the quality of the data and the learned value function; 2) it can safely and robustly utilize data from arbitrary behavior policy without additional correction, whatever its quality or "off-policyness". We analyze the theoretical properties of the related operator, showing that it is able to converge to the global optimal value function, with a ratio faster than the traditional Bellman Optimality Operator. Furthermore, based on this new operator, we derive new model-free RL algorithms named Greedy Multi-Step Q Learning (and Greedy Multi-step DQN). Experiments reveal that the proposed methods are reliable, easy to implement, and achieve state-of-the-art performance on a series of standard benchmark datasets.

中文翻译:

贪婪的多步非政策强化学习

多步非政策强化学习取得了巨大的成功。但是,现有的多步骤方法通常在引导步骤上施加固定的先验,而脱离策略的方法通常需要进行其他更正,这会带来某些不良影响。在本文中,我们提出了一种新颖的自举方法,该方法以不同的步长贪婪地获取自举值中的最大值。新方法具有两个期望的特性:1)它可以根据数据质量和学习值函数灵活地调整引导步骤;2)它可以安全,可靠地利用任意行为策略中的数据,而无需对其进行任何质量或“非政策性”调整。我们分析了相关算子的理论性质,表明它能够收敛到全局最优值函数,并且比传统的Bellman最优性算子要快。此外,基于此新运算符,我们得出了名为Greedy Multi-step Q Learning(和Greedy Multi-step DQN)的新的无模型RL算法。实验表明,所提出的方法可靠,易于实施,并且可以在一系列标准基准数据集上达到最先进的性能。
更新日期:2021-02-24
down
wechat
bug