Sampling Rate Decay in Hindsight Experience Replay for Robot Control,IEEE Transactions on Cybernetics

当前位置： X-MOL 学术 › IEEE Trans. Cybern. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Sampling Rate Decay in Hindsight Experience Replay for Robot Control
IEEE Transactions on Cybernetics ( IF 11.8 ) Pub Date : 2020-05-21 , DOI: 10.1109/tcyb.2020.2990722
Luiz Felipe Vecchietti ₁ , Minah Seo ₁ , Dongsoo Har ₁

Affiliation

Training agents via deep reinforcement learning with sparse rewards for robotic control tasks in vast state space are a big challenge, due to the rareness of successful experience. To solve this problem, recent breakthrough methods, the hindsight experience replay (HER) and aggressive rewards to counter bias in HER (ARCHER), use unsuccessful experiences and consider them as successful experiences achieving different goals, for example, hindsight experiences. According to these methods, hindsight experience is used at a fixed sampling rate during training. However, this usage of hindsight experience introduces bias, due to a distinct optimal policy, and does not allow the hindsight experience to take variable importance at different stages of training. In this article, we investigate the impact of a variable sampling rate, representing the variable rate of hindsight experience, on training performance and propose a sampling rate decay strategy that decreases the number of hindsight experiences as training proceeds. The proposed method is validated with three robotic control tasks included in the OpenAI Gym suite. The experimental results demonstrate that the proposed method achieves improved training performance and increased convergence speed over the HER and ARCHER with two of the three tasks and comparable training performance and convergence speed with the other one.

中文翻译：

机器人控制的后见之明经验回放中的采样率衰减

由于成功经验的稀有，通过深度强化学习对广阔状态空间中的机器人控制任务进行稀疏奖励训练代理是一个巨大的挑战。为了解决这个问题，最近的突破性方法，后见之明经验重播（HER）和积极奖励以对抗 HER 中的偏见（ARCHER），使用不成功的经验并将其视为实现不同目标的成功经验，例如后见之明的经验。根据这些方法，在训练期间以固定的采样率使用事后经验。然而，这种事后经验的使用引入了偏见，由于明显的最优策略，并且不允许事后经验在训练的不同阶段具有不同的重要性。在本文中，我们研究了可变采样率的影响，代表事后经验的可变率，对训练性能，并提出一个采样率衰减策略，随着训练的进行减少事后经验的数量。所提出的方法通过 OpenAI Gym 套件中包含的三个机器人控制任务进行了验证。实验结果表明，所提出的方法在 HER 和 ARCHER 上通过三个任务中的两个实现了改进的训练性能和收敛速度，并且与另一个具有可比的训练性能和收敛速度。

更新日期：2020-05-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>