当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
FRESH: Interactive Reward Shaping in High-Dimensional State Spaces using Human Feedback
arXiv - CS - Artificial Intelligence Pub Date : 2020-01-19 , DOI: arxiv-2001.06781
Baicen Xiao, Qifan Lu, Bhaskar Ramasubramanian, Andrew Clark, Linda Bushnell, Radha Poovendran

Reinforcement learning has been successful in training autonomous agents to accomplish goals in complex environments. Although this has been adapted to multiple settings, including robotics and computer games, human players often find it easier to obtain higher rewards in some environments than reinforcement learning algorithms. This is especially true of high-dimensional state spaces where the reward obtained by the agent is sparse or extremely delayed. In this paper, we seek to effectively integrate feedback signals supplied by a human operator with deep reinforcement learning algorithms in high-dimensional state spaces. We call this FRESH (Feedback-based REward SHaping). During training, a human operator is presented with trajectories from a replay buffer and then provides feedback on states and actions in the trajectory. In order to generalize feedback signals provided by the human operator to previously unseen states and actions at test-time, we use a feedback neural network. We use an ensemble of neural networks with a shared network architecture to represent model uncertainty and the confidence of the neural network in its output. The output of the feedback neural network is converted to a shaping reward that is augmented to the reward provided by the environment. We evaluate our approach on the Bowling and Skiing Atari games in the arcade learning environment. Although human experts have been able to achieve high scores in these environments, state-of-the-art deep learning algorithms perform poorly. We observe that FRESH is able to achieve much higher scores than state-of-the-art deep learning algorithms in both environments. FRESH also achieves a 21.4% higher score than a human expert in Bowling and does as well as a human expert in Skiing.

中文翻译:

FRESH:使用人类反馈在高维状态空间中的交互式奖励塑造

强化学习在训练自主代理以在复杂环境中完成目标方面取得了成功。尽管这已经适用于多种设置,包括机器人和电脑游戏,但人类玩家通常发现在某些环境中比强化学习算法更容易获得更高的奖励。对于高维状态空间尤其如此,其中代理获得的奖励稀疏或极度延迟。在本文中,我们寻求将人类操作员提供的反馈信号与高维状态空间中的深度强化学习算法有效集成。我们称之为 FRESH(基于反馈的奖励塑造)。在训练期间,人类操作员会看到来自重放缓冲区的轨迹,然后提供关于轨迹中状态和动作的反馈。为了将人类操作员提供的反馈信号概括为测试时之前未见过的状态和动作,我们使用了反馈神经网络。我们使用一组具有共享网络架构的神经网络来表示模型的不确定性和神经网络对其输出的置信度。反馈神经网络的输出被转换为塑造奖励,该奖励被增强为环境提供的奖励。我们评估了我们在街机学习环境中保龄球和滑雪 Atari 游戏的方法。尽管人类专家已经能够在这些环境中获得高分,但最先进的深度学习算法表现不佳。我们观察到 FRESH 能够在两种环境中获得比最先进的深度学习算法高得多的分数。FRESH 也达到了 21。
更新日期:2020-01-22
down
wechat
bug