Effects of sparse rewards of different magnitudes in the speed of learning of model-based actor critic methods,arXiv - CS - Robotics

当前位置： X-MOL 学术 › arXiv.cs.RO › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Effects of sparse rewards of different magnitudes in the speed of learning of model-based actor critic methods
arXiv - CS - Robotics Pub Date : 2020-01-18 , DOI: arxiv-2001.06725
Juan Vargas, Lazar Andjelic, Amir Barati Farimani

Actor critic methods with sparse rewards in model-based deep reinforcement learning typically require a deterministic binary reward function that reflects only two possible outcomes: if, for each step, the goal has been achieved or not. Our hypothesis is that we can influence an agent to learn faster by applying an external environmental pressure during training, which adversely impacts its ability to get higher rewards. As such, we deviate from the classical paradigm of sparse rewards and add a uniformly sampled reward value to the baseline reward to show that (1) sample efficiency of the training process can be correlated to the adversity experienced during training, (2) it is possible to achieve higher performance in less time and with less resources, (3) we can reduce the performance variability experienced seed over seed, (4) there is a maximum point after which more pressure will not generate better results, and (5) that random positive incentives have an adverse effect when using a negative reward strategy, making an agent under those conditions learn poorly and more slowly. These results have been shown to be valid for Deep Deterministic Policy Gradients using Hindsight Experience Replay in a well known Mujoco environment, but we argue that they could be generalized to other methods and environments as well.

中文翻译：

不同量级的稀疏奖励对基于模型的演员评论家方法学习速度的影响

在基于模型的深度强化学习中具有稀疏奖励的 Actor 评论家方法通常需要一个确定性的二元奖励函数，该函数仅反映两种可能的结果：对于每一步，目标是否已实现。我们的假设是，我们可以通过在训练期间施加外部环境压力来影响代理更快地学习，这对其获得更高奖励的能力产生不利影响。因此，我们偏离了稀疏奖励的经典范式，并在基线奖励中添加了一个统一采样的奖励值，以表明（1）训练过程的样本效率可以与训练期间经历的逆境相关，（2）它是可以在更短的时间和更少的资源下实现更高的性能，（3）我们可以减少种子对种子的性能变化，(4) 有一个最大值点，在此之后更多的压力不会产生更好的结果，以及 (5) 随机的正激励在使用负奖励策略时会产生不利影响，使得在这些条件下的智能体学习效果不佳且更慢。这些结果已被证明对于在众所周知的 Mujoco 环境中使用事后经验重放的深度确定性策略梯度是有效的，但我们认为它们也可以推广到其他方法和环境。

更新日期：2020-01-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文