Off-Policy Exploitability-Evaluation and Equilibrium-Learning in Two-Player Zero-Sum Markov Games,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Off-Policy Exploitability-Evaluation and Equilibrium-Learning in Two-Player Zero-Sum Markov Games
arXiv - CS - Computer Science and Game Theory Pub Date : 2020-07-04 , DOI: arxiv-2007.02141
Kenshi Abe, Yusuke Kaneko

Off-policy evaluation (OPE) is the problem of evaluating new policies using historical data obtained from a different policy. Off-policy learning (OPL), on the other hand, is the problem of finding an optimal policy using historical data. In recent OPE and OPL contexts, most of the studies have focused on one-player cases, and not on more than two-player cases. In this study, we propose methods for OPE and OPL in two-player zero-sum Markov games. For OPE, we estimate exploitability that is often used as a metric for determining how close a strategy profile is to a Nash equilibrium in two-player zero-sum games. For OPL, we calculate maximin policies as Nash equilibrium strategies over the historical data. We prove the exploitability estimation error bounds for OPE and regret bounds for OPL based on the doubly robust and double reinforcement learning estimators. Finally, we demonstrate the effectiveness and performance of the proposed methods through experiments.

中文翻译：

两人零和马尔可夫博弈中的离策略可利用性评估和均衡学习

离策略评估 (OPE) 是使用从不同策略获得的历史数据评估新策略的问题。另一方面，离策略学习 (OPL) 是使用历史数据找到最佳策略的问题。在最近的 OPE 和 OPL 背景下，大多数研究都集中在单人案件上，而不是多于两人的案件。在这项研究中，我们提出了两人零和马尔可夫博弈中 OPE 和 OPL 的方法。对于 OPE，我们估计了可利用性，该可利用性通常用作确定策略配置文件与两人零和游戏中纳什均衡的接近程度的指标。对于 OPL，我们将 maximin 策略计算为基于历史数据的纳什均衡策略。我们基于双重稳健和双重强化学习估计器证明了 OPE 的可利用性估计错误界限和 OPL 的遗憾界限。最后，我们通过实验证明了所提出方法的有效性和性能。

更新日期：2020-07-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文