当前位置:
X-MOL 学术
›
arXiv.cs.MA
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Combining No-regret and Q-learning
arXiv - CS - Multiagent Systems Pub Date : 2019-10-07 , DOI: arxiv-1910.03094 Ian A. Kash, Michael Sullins, Katja Hofmann
arXiv - CS - Multiagent Systems Pub Date : 2019-10-07 , DOI: arxiv-1910.03094 Ian A. Kash, Michael Sullins, Katja Hofmann
Counterfactual Regret Minimization (CFR) has found success in settings like
poker which have both terminal states and perfect recall. We seek to understand
how to relax these requirements. As a first step, we introduce a simple
algorithm, local no-regret learning (LONR), which uses a Q-learning-like update
rule to allow learning without terminal states or perfect recall. We prove its
convergence for the basic case of MDPs (and limited extensions of them) and
present empirical results showing that it achieves last iterate convergence in
a number of settings, most notably NoSDE games, a class of Markov games
specifically designed to be challenging to learn where no prior algorithm is
known to achieve convergence to a stationary equilibrium even on average.
中文翻译:
结合 No-regret 和 Q-learning
Counterfactual Regret Minimization (CFR) 在扑克等具有终结状态和完美回忆的环境中取得了成功。我们寻求了解如何放宽这些要求。作为第一步,我们引入了一种简单的算法,即局部无后悔学习 (LONR),它使用类似 Q 学习的更新规则来允许在没有终端状态或完美回忆的情况下进行学习。我们证明了 MDP 的基本情况(以及它们的有限扩展)的收敛性,并提供了实证结果表明它在许多设置中实现了最后一次迭代收敛,最显着的是 NoSDE 游戏,这是一类专门设计为具有挑战性的马尔可夫游戏学习在没有已知先验算法的情况下,即使平均而言也无法收敛到平稳平衡。
更新日期:2020-03-18
中文翻译:
结合 No-regret 和 Q-learning
Counterfactual Regret Minimization (CFR) 在扑克等具有终结状态和完美回忆的环境中取得了成功。我们寻求了解如何放宽这些要求。作为第一步,我们引入了一种简单的算法,即局部无后悔学习 (LONR),它使用类似 Q 学习的更新规则来允许在没有终端状态或完美回忆的情况下进行学习。我们证明了 MDP 的基本情况(以及它们的有限扩展)的收敛性,并提供了实证结果表明它在许多设置中实现了最后一次迭代收敛,最显着的是 NoSDE 游戏,这是一类专门设计为具有挑战性的马尔可夫游戏学习在没有已知先验算法的情况下,即使平均而言也无法收敛到平稳平衡。