当前位置: X-MOL 学术arXiv.cs.GT › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium
arXiv - CS - Computer Science and Game Theory Pub Date : 2020-02-17 , DOI: arxiv-2002.07066
Qiaomin Xie, Yudong Chen, Zhaoran Wang, Zhuoran Yang

We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the least-squares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an $\tilde O(\sqrt{d^3 H^3 T} )$ upper bound on the duality gap and regret, where $d$ is the linear dimension, $H$ the horizon and $T$ the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turn-based Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solving a general-sum matrix game with these bounds as the payoff matrices. As finding the Nash Equilibrium of a general-sum game is computationally hard, our algorithm instead solves for a Coarse Correlated Equilibrium (CCE), which can be obtained efficiently. To our best knowledge, such a CCE-based scheme for optimism has not appeared in the literature and might be of interest in its own right.

中文翻译:

使用函数逼近和相关均衡学习零和同时移动马尔可夫游戏

我们为同时移动的两人零和有限水平马尔可夫博弈开发了可证明有效的强化学习算法。为了结合函数逼近,我们考虑了一系列马尔可夫博弈,其中奖励函数和转移核具有线性结构。考虑了问题的离线和在线设置。在离线设置中,我们控制两个玩家并旨在通过最小化二元差距来找到纳什均衡。在在线设置中,我们控制一个玩家与任意对手对战,旨在最大程度地减少遗憾。对于这两种设置,我们提出了最小二乘极大值迭代算法的乐观变体。我们表明我们的算法在计算上是有效的,并且可以证明在对偶间隙和遗憾上实现了 $\tilde O(\sqrt{d^3 H^3 T} )$ 上限,其中 $d$ 是线性维度,$H$ 是地平线,$T$ 是时间步长的总数。我们的结果不需要对抽样模型进行额外的假设。我们的设置需要克服马尔可夫决策过程或回合制马尔可夫游戏中不存在的几个新挑战。特别是,为了实现同步移动的乐观,我们构建了价值函数的置信上限和下限,然后通过以这些边界作为收益矩阵求解一般和矩阵博弈来计算乐观策略。由于找到一般和游戏的纳什均衡在计算上很困难,我们的算法改为求解可以有效获得的粗相关均衡 (CCE)。据我们所知,
更新日期:2020-06-25
down
wechat
bug