Bias-reduced multi-step hindsight experience replay,arXiv - CS - Robotics

当前位置： X-MOL 学术 › arXiv.cs.RO › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Bias-reduced multi-step hindsight experience replay
arXiv - CS - Robotics Pub Date : 2021-02-25 , DOI: arxiv-2102.12962
Rui Yang, Jiafei Lyu, Yu Yang, Jiangpeng Ya, Feng Luo, Dijun Luo, Lanqing Li, Xiu Li

Multi-goal reinforcement learning is widely used in planning and robot manipulation. Two main challenges in multi-goal reinforcement learning are sparse rewards and sample inefficiency. Hindsight Experience Replay (HER) aims to tackle the two challenges with hindsight knowledge. However, HER and its previous variants still need millions of samples and a huge computation. In this paper, we propose \emph{Multi-step Hindsight Experience Replay} (MHER) based on $n$-step relabeling, incorporating multi-step relabeled returns to improve sample efficiency. Despite the advantages of $n$-step relabeling, we theoretically and experimentally prove the off-policy $n$-step bias introduced by $n$-step relabeling may lead to poor performance in many environments. To address the above issue, two bias-reduced MHER algorithms, MHER($\lambda$) and Model-based MHER (MMHER) are presented. MHER($\lambda$) exploits the $\lambda$ return while MMHER benefits from model-based value expansions. Experimental results on numerous multi-goal robotic tasks show that our solutions can successfully alleviate off-policy $n$-step bias and achieve significantly higher sample efficiency than HER and Curriculum-guided HER with little additional computation beyond HER.

中文翻译：

减少偏见的多步后视经验重播

多目标强化学习广泛用于计划和机器人操纵中。多目标强化学习中的两个主要挑战是稀疏奖励和样本效率低下。后见之明体验重播（HER）旨在利用后见之明知识解决这两个挑战。但是，HER及其以前的变体仍然需要数百万个样本和大量计算。在本文中，我们提出了基于$ n $步重新标记的\ emph {多步后视经验重播}（MHER），并结合了多步重新标记的收益以提高样本效率。尽管$ n $步骤重新标记具有优势，但我们在理论上和实验上证明了$ n $步骤重新标记引入的偏离政策的$ n $步骤偏差可能会导致在许多环境下性能下降。为了解决上述问题，采用了两种降低偏斜的MHER算法，介绍了MHER（$ \ lambda $）和基于模型的MHER（MMHER）。MHER（$ \ lambda $）利用$ \ lambda $收益，而MMHER则受益于基于模型的价值扩展。在众多多目标机器人任务上的实验结果表明，与HER和课程指导的HER相比，我们的解决方案可以成功减轻政策外的$ n $步长偏差，并显着提高样本效率，而HER之外几乎没有其他计算。

更新日期：2021-02-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文