Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games,arXiv - CS - Multiagent Systems

当前位置： X-MOL 学术 › arXiv.cs.MA › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games
arXiv - CS - Multiagent Systems Pub Date : 2020-06-15 , DOI: arxiv-2006.08555
Stephen McAleer, John Lanier, Roy Fox, Pierre Baldi

Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable general method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of $10^{50}$. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots.

中文翻译：

Pipeline PSRO：一种在大型博弈中寻找近似纳什均衡的可扩展方法

当信息状态的数量很大时，在零和不完美信息博弈中寻找近似的纳什均衡具有挑战性。Policy Space Response Oracles (PSRO) 是一种基于博弈论的深度强化学习算法，可以保证收敛到近似的纳什均衡。然而，PSRO 需要在每次迭代时训练强化学习策略，这对于大型游戏来说太慢了。我们通过反例和实验表明，DCH 和 Rectified PSRO 这两种现有的扩大 PSRO 的方法，即使在小游戏中也无法收敛。我们介绍了流水线 PSRO (P2SRO)，这是第一个在大型零和不完美信息博弈中寻找近似纳什均衡的可扩展通用方法。P2SRO 能够通过维护强化学习工作者的分层管道来并行化具有收敛保证的 PSRO，每次训练都针对由层次结构中较低级别生成的策略。我们表明，与现有方法不同，P2SRO 收敛到近似纳什均衡，并且随着并行工作者数量的增加，在各种不完美信息博弈中收敛速度更快。我们还为 Barrage Stratego 引入了一个开源环境，它是 Stratego 的一种变体，其博弈树的复杂度约为 $10^{50}$。P2SRO 能够在 Barrage Stratego 上实现最先进的性能并击败所有现有机器人。并且随着并行工作者数量的增加，在各种不完美的信息博弈中变得更快。我们还为 Barrage Stratego 引入了一个开源环境，它是 Stratego 的一种变体，其博弈树的复杂度约为 $10^{50}$。P2SRO 能够在 Barrage Stratego 上实现最先进的性能并击败所有现有机器人。并且随着并行工作者数量的增加，在各种不完美的信息博弈中变得更快。我们还为 Barrage Stratego 引入了一个开源环境，它是 Stratego 的一种变体，其博弈树的复杂度约为 $10^{50}$。P2SRO 能够在 Barrage Stratego 上实现最先进的性能并击败所有现有机器人。

更新日期：2020-06-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>