当前位置:
X-MOL 学术
›
arXiv.cs.AI
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy Reinforcement Learning
arXiv - CS - Artificial Intelligence Pub Date : 2021-02-22 , DOI: arxiv-2102.11319 Brett Daley, Cameron Hickert, Christopher Amato
arXiv - CS - Artificial Intelligence Pub Date : 2021-02-22 , DOI: arxiv-2102.11319 Brett Daley, Cameron Hickert, Christopher Amato
Deep Reinforcement Learning (RL) methods rely on experience replay to
approximate the minibatched supervised learning setting; however, unlike
supervised learning where access to lots of training data is crucial to
generalization, replay-based deep RL appears to struggle in the presence of
extraneous data. Recent works have shown that the performance of Deep Q-Network
(DQN) degrades when its replay memory becomes too large. This suggests that outdated experiences somehow impact the performance of
deep RL, which should not be the case for off-policy methods like DQN.
Consequently, we re-examine the motivation for sampling uniformly over a replay
memory, and find that it may be flawed when using function approximation. We
show that -- despite conventional wisdom -- sampling from the uniform
distribution does not yield uncorrelated training samples and therefore biases
gradients during training. Our theory prescribes a special non-uniform
distribution to cancel this effect, and we propose a stratified sampling scheme
to efficiently implement it.
中文翻译:
分层的体验重播:纠正非策略强化学习中的多重性偏见
深度强化学习(RL)方法依靠经验重播来近似最小批监督学习设置;但是,不同于在大量学习中获取大量训练数据至关重要的监督学习,基于重放的深度RL在存在无关数据的情况下似乎很挣扎。最近的工作表明,当深度Q网络(DQN)的重放内存太大时,其性能会下降。这表明过时的体验会以某种方式影响深度RL的性能,而对于DQN这样的非政策性方法则不应该如此。因此,我们重新检查了在重播内存中进行统一采样的动机,并发现使用函数逼近时可能存在缺陷。我们显示-尽管具有传统知识-从均匀分布中采样不会产生不相关的训练样本,因此在训练过程中会偏斜梯度。我们的理论规定了一种特殊的非均匀分布来消除这种影响,并且我们提出了一种分层采样方案来有效地实现它。
更新日期:2021-02-24
中文翻译:
分层的体验重播:纠正非策略强化学习中的多重性偏见
深度强化学习(RL)方法依靠经验重播来近似最小批监督学习设置;但是,不同于在大量学习中获取大量训练数据至关重要的监督学习,基于重放的深度RL在存在无关数据的情况下似乎很挣扎。最近的工作表明,当深度Q网络(DQN)的重放内存太大时,其性能会下降。这表明过时的体验会以某种方式影响深度RL的性能,而对于DQN这样的非政策性方法则不应该如此。因此,我们重新检查了在重播内存中进行统一采样的动机,并发现使用函数逼近时可能存在缺陷。我们显示-尽管具有传统知识-从均匀分布中采样不会产生不相关的训练样本,因此在训练过程中会偏斜梯度。我们的理论规定了一种特殊的非均匀分布来消除这种影响,并且我们提出了一种分层采样方案来有效地实现它。