Reward Shaping for Reinforcement Learning with Omega-Regular Objectives,arXiv - CS - Logic in Computer Science

当前位置： X-MOL 学术 › arXiv.cs.LO › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Reward Shaping for Reinforcement Learning with Omega-Regular Objectives
arXiv - CS - Logic in Computer Science Pub Date : 2020-01-16 , DOI: arxiv-2001.05977
E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, D. Wojtczak

Recently, successful approaches have been made to exploit good-for-MDPs automata (B\"uchi automata with a restricted form of nondeterminism) for model free reinforcement learning, a class of automata that subsumes good for games automata and the most widespread class of limit deterministic automata. The foundation of using these B\"uchi automata is that the B\"uchi condition can, for good-for-MDP automata, be translated to reachability. The drawback of this translation is that the rewards are, on average, reaped very late, which requires long episodes during the learning process. We devise a new reward shaping approach that overcomes this issue. We show that the resulting model is equivalent to a discounted payoff objective with a biased discount that simplifies and improves on prior work in this direction.

中文翻译：

具有 Omega-Regular 目标的强化学习的奖励塑造

最近，已经成功地利用适合 MDP 的自动机（具有受限形式的非确定性的 B\"uchi 自动机）进行无模型强化学习，这是一类包含有利于游戏自动机和最广泛使用的自动机类别的自动机。限制确定性自动机。使用这些 B\" uchi 自动机的基础是 B\" uchi 条件可以，对于良好的 MDP 自动机，可以转换为可达性。这种转换的缺点是奖励是，平均, 收获很晚，这在学习过程中需要很长的时间。我们设计了一种新的奖励塑造方法来克服这个问题。我们表明，生成的模型等效于带有偏置折扣的折扣收益目标，简化并改进了先前的工作在这个方向。

更新日期：2020-01-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文