当前位置: X-MOL 学术arXiv.cs.RO › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Curious Exploration and Return-based Memory Restoration for Deep Reinforcement Learning
arXiv - CS - Robotics Pub Date : 2021-05-02 , DOI: arxiv-2105.00499
Saeed Tafazzol, Erfan Fathi, Mahdi Rezaei, Ehsan Asali

Reward engineering and designing an incentive reward function are non-trivial tasks to train agents in complex environments. Furthermore, an inaccurate reward function may lead to a biased behaviour which is far from an efficient and optimised behaviour. In this paper, we focus on training a single agent to score goals with binary success/failure reward function in Half Field Offense domain. As the major advantage of this research, the agent has no presumption about the environment which means it only follows the original formulation of reinforcement learning agents. The main challenge of using such a reward function is the high sparsity of positive reward signals. To address this problem, we use a simple prediction-based exploration strategy (called Curious Exploration) along with a Return-based Memory Restoration (RMR) technique which tends to remember more valuable memories. The proposed method can be utilized to train agents in environments with fairly complex state and action spaces. Our experimental results show that many recent solutions including our baseline method fail to learn and perform in complex soccer domain. However, the proposed method can converge easily to the nearly optimal behaviour. The video presenting the performance of our trained agent is available at http://bit.ly/HFO_Binary_Reward.

中文翻译:

用于深度强化学习的好奇探索和基于返回的记忆还原

奖励工程和设计奖励机制是在复杂环境中培训代理商的一项艰巨的任务。此外,不正确的奖励功能可能导致行为偏离,这与有效和优化的行为相去甚远。在本文中,我们专注于训练单个代理以在Half Field Offence域中使用二进制成功/失败奖励函数为目标评分。作为这项研究的主要优势,该代理没有关于环境的假设,这意味着它仅遵循强化学习代理的原始配方。使用这样的奖励功能的主要挑战是正奖励信号的高度稀疏性。为了解决这个问题,我们使用简单的基于预测的探索策略(称为“好奇探索”)以及基于返回的记忆恢复(RMR)技术,该技术往往会记住更有价值的记忆。所提出的方法可用于在状态和动作空间相当复杂的环境中训练代理。我们的实验结果表明,包括基线方法在内的许多最新解决方案都无法在复杂的足球领域中学习和执行。然而,所提出的方法可以容易地收敛到几乎最佳的行为。展示我们训练有素的经纪人表现的视频可在http://bit.ly/HFO_Binary_Reward获得。我们的实验结果表明,包括基线方法在内的许多最新解决方案都无法在复杂的足球领域中学习和执行。然而,所提出的方法可以容易地收敛到几乎最佳的行为。展示我们训练有素的经纪人表现的视频可在http://bit.ly/HFO_Binary_Reward获得。我们的实验结果表明,包括基线方法在内的许多最新解决方案都无法在复杂的足球领域中学习和执行。然而,所提出的方法可以容易地收敛到几乎最佳的行为。展示我们训练有素的经纪人表现的视频可在http://bit.ly/HFO_Binary_Reward获得。
更新日期:2021-05-04
down
wechat
bug