当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Solving Sokoban with backward reinforcement learning
arXiv - CS - Artificial Intelligence Pub Date : 2021-05-05 , DOI: arxiv-2105.01904
Yaron Shoham, Gal Elidan

In some puzzles, the strategy we need to use near the goal can be quite different from the strategy that is effective earlier on, e.g. due to a smaller branching factor near the exit state in a maze. A common approach in these cases is to apply both a forward and a backward search, and to try and align the two. In this work we propose an approach that takes this idea a step forward, within a reinforcement learning (RL) framework. Training a traditional forward-looking agent using RL can be difficult because rewards are often sparse, e.g. given only at the goal. Instead, we first train a backward-looking agent with a simple relaxed goal. We then augment the state representation of the puzzle with straightforward hint features that are extracted from the behavior of that agent. Finally, we train a forward looking agent with this informed augmented state. We demonstrate that this simple "access" to partial backward plans leads to a substantial performance boost. On the challenging domain of the Sokoban puzzle, our RL approach substantially surpasses the best learned solvers that generalize over levels, and is competitive with SOTA performance of the best highly-crafted solution. Impressively, we achieve these results while learning from only a small number of practice levels and using simple RL techniques.

中文翻译:

通过向后强化学习解决推箱子

在某些难题中,我们需要在目标附近使用的策略可能与早先生效的策略完全不同,例如,由于迷宫中出口状态附近的分支系数较小。在这些情况下,一种常见的方法是同时应用前向搜索和后向搜索,并尝试将两者对齐。在这项工作中,我们提出了一种在强化学习(RL)框架内将这一想法向前推进的方法。使用RL训练传统的前瞻性代理可能会很困难,因为奖励通常很少,例如仅在目标上给予。取而代之的是,我们首先训练一个目标简单,目标宽松的后进代理商。然后,我们使用从该代理的行为中提取的简单提示功能来增强拼图的状态表示。最后,我们在这种知情的增强状态下训练了前瞻性的特工。我们证明了对部分向后计划的这种简单“访问”可显着提高性能。在推箱子难题中具有挑战性的领域,我们的RL方法大大超越了在各个层次上都有广泛应用的博学精湛的求解器,并且与最佳精心设计的解决方案的SOTA性能相比具有竞争优势。令人印象深刻的是,我们在仅从少量练习水平中学习并使用简单的RL技术的同时,就获得了这些结果。并与最佳精心设计的解决方案的SOTA性能相竞争。令人印象深刻的是,我们在仅从少量练习水平中学习并使用简单的RL技术的同时,就获得了这些结果。并与最佳精心设计的解决方案的SOTA性能相竞争。令人印象深刻的是,我们在仅从少量练习水平中学习并使用简单的RL技术的同时,就获得了这些结果。
更新日期:2021-05-06
down
wechat
bug