Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
arXiv - CS - Machine Learning Pub Date : 2020-09-29 , DOI: arxiv-2009.14108
Vihang P. Patil, Markus Hofmarcher, Marius-Constantin Dinu, Matthias Dorfer, Patrick M. Blies, Johannes Brandstetter, Jose A. Arjona-Medina, Sepp Hochreiter

Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks can often be hierarchically decomposed into sub-tasks. A step in the Q-function can be associated with solving a sub-task, where the expectation of the return increases. RUDDER has been introduced to identify these steps and then redistribute reward to them, thus immediately giving reward if sub-tasks are solved. Since the problem of delayed rewards is mitigated, learning is considerably sped up. However, for complex tasks, current exploration strategies as deployed in RUDDER struggle with discovering episodes with high rewards. Therefore, we assume that episodes with high rewards are given as demonstrations and do not have to be discovered by exploration. Typically the number of demonstrations is small and RUDDER's LSTM model as a deep learning method does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations, replacing RUDDER's safe exploration and lessons replay buffer. Second, we replace RUDDER's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations. Profile models can be constructed from as few as two demonstrations as known from bioinformatics. Align-RUDDER inherits the concept of reward redistribution, which considerably reduces the delay of rewards, thus speeding up learning. Align-RUDDER outperforms competitors on complex artificial tasks with delayed reward and few demonstrations. On the MineCraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Github: https://github.com/ml-jku/align-rudder, YouTube: https://youtu.be/HO-_8ZUl-UY

中文翻译：

Align-RUDDER：通过奖励重新分配从少数演示中学习

强化学习算法需要大量样本来解决具有稀疏和延迟奖励的复杂任务。复杂的任务通常可以分层分解为子任务。Q 函数中的一个步骤可以与解决子任务相关联，其中回报的期望增加。引入了 RUDDER 来识别这些步骤，然后将奖励重新分配给它们，从而在解决子任务时立即给予奖励。由于延迟奖励的问题得到缓解，学习速度大大加快。然而，对于复杂的任务，在 RUDDER 中部署的当前探索策略难以发现具有高回报的情节。因此，我们假设具有高奖励的剧集作为演示给出，不必通过探索发现。通常演示的次数很少，并且 RUDDER 的 LSTM 模型作为深度学习方法的学习效果不佳。因此，我们引入了 Align-RUDDER，它是 RUDDER，有两个主要修改。首先，Align-RUDDER 假设奖励高的剧集作为演示给出，取代 RUDDER 的安全探索和课程重播缓冲区。其次，我们将 RUDDER 的 LSTM 模型替换为从演示的多序列比对中获得的配置文件模型。可以从生物信息学中已知的最少两个演示构建剖面模型。Align-RUDDER 继承了奖励再分配的概念，大大减少了奖励的延迟，从而加快了学习速度。Align-RUDDER 在复杂的人工任务上优于竞争对手，但奖励延迟且演示很少。在 MineCraft 获取钻石任务中，Align-RUDDER 能够挖掘钻石，但并不频繁。Github：https://github.com/ml-jku/align-rudder，YouTube：https://youtu.be/HO-_8ZUl-UY

更新日期：2020-09-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>