Combining No-regret and Q-learning,arXiv - CS - Multiagent Systems

当前位置： X-MOL 学术 › arXiv.cs.MA › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Combining No-regret and Q-learning
arXiv - CS - Multiagent Systems Pub Date : 2019-10-07 , DOI: arxiv-1910.03094
Ian A. Kash, Michael Sullins, Katja Hofmann

Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average.

中文翻译：

结合 No-regret 和 Q-learning

Counterfactual Regret Minimization (CFR) 在扑克等具有终结状态和完美回忆的环境中取得了成功。我们寻求了解如何放宽这些要求。作为第一步，我们引入了一种简单的算法，即局部无后悔学习 (LONR)，它使用类似 Q 学习的更新规则来允许在没有终端状态或完美回忆的情况下进行学习。我们证明了 MDP 的基本情况（以及它们的有限扩展）的收敛性，并提供了实证结果表明它在许多设置中实现了最后一次迭代收敛，最显着的是 NoSDE 游戏，这是一类专门设计为具有挑战性的马尔可夫游戏学习在没有已知先验算法的情况下，即使平均而言也无法收敛到平稳平衡。

更新日期：2020-03-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文