当前位置:
X-MOL 学术
›
arXiv.cs.MA
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Greedy Multi-step Off-Policy Reinforcement Learning
arXiv - CS - Multiagent Systems Pub Date : 2021-02-23 , DOI: arxiv-2102.11717 Yuhui Wang, Pengcheng He, Xiaoyang Tan
arXiv - CS - Multiagent Systems Pub Date : 2021-02-23 , DOI: arxiv-2102.11717 Yuhui Wang, Pengcheng He, Xiaoyang Tan
Multi-step off-policy reinforcement learning has achieved great success.
However, existing multi-step methods usually impose a fixed prior on the
bootstrap steps, while the off-policy methods often require additional
correction, suffering from certain undesired effects. In this paper, we propose
a novel bootstrapping method, which greedily takes the maximum value among the
bootstrapping values with varying steps. The new method has two desired
properties:1) it can flexibly adjust the bootstrap step based on the quality of
the data and the learned value function; 2) it can safely and robustly utilize
data from arbitrary behavior policy without additional correction, whatever its
quality or "off-policyness". We analyze the theoretical properties of the
related operator, showing that it is able to converge to the global optimal
value function, with a ratio faster than the traditional Bellman Optimality
Operator. Furthermore, based on this new operator, we derive new model-free RL
algorithms named Greedy Multi-Step Q Learning (and Greedy Multi-step DQN).
Experiments reveal that the proposed methods are reliable, easy to implement,
and achieve state-of-the-art performance on a series of standard benchmark
datasets.
中文翻译:
贪婪的多步非政策强化学习
多步非政策强化学习取得了巨大的成功。但是,现有的多步骤方法通常在引导步骤上施加固定的先验,而脱离策略的方法通常需要进行其他更正,这会带来某些不良影响。在本文中,我们提出了一种新颖的自举方法,该方法以不同的步长贪婪地获取自举值中的最大值。新方法具有两个期望的特性:1)它可以根据数据质量和学习值函数灵活地调整引导步骤;2)它可以安全,可靠地利用任意行为策略中的数据,而无需对其进行任何质量或“非政策性”调整。我们分析了相关算子的理论性质,表明它能够收敛到全局最优值函数,并且比传统的Bellman最优性算子要快。此外,基于此新运算符,我们得出了名为Greedy Multi-step Q Learning(和Greedy Multi-step DQN)的新的无模型RL算法。实验表明,所提出的方法可靠,易于实施,并且可以在一系列标准基准数据集上达到最先进的性能。
更新日期:2021-02-24
中文翻译:
贪婪的多步非政策强化学习
多步非政策强化学习取得了巨大的成功。但是,现有的多步骤方法通常在引导步骤上施加固定的先验,而脱离策略的方法通常需要进行其他更正,这会带来某些不良影响。在本文中,我们提出了一种新颖的自举方法,该方法以不同的步长贪婪地获取自举值中的最大值。新方法具有两个期望的特性:1)它可以根据数据质量和学习值函数灵活地调整引导步骤;2)它可以安全,可靠地利用任意行为策略中的数据,而无需对其进行任何质量或“非政策性”调整。我们分析了相关算子的理论性质,表明它能够收敛到全局最优值函数,并且比传统的Bellman最优性算子要快。此外,基于此新运算符,我们得出了名为Greedy Multi-step Q Learning(和Greedy Multi-step DQN)的新的无模型RL算法。实验表明,所提出的方法可靠,易于实施,并且可以在一系列标准基准数据集上达到最先进的性能。