当前位置: X-MOL 学术arXiv.cs.GT › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient Competitive Self-Play Policy Optimization
arXiv - CS - Computer Science and Game Theory Pub Date : 2020-09-13 , DOI: arxiv-2009.06086
Yuanyi Zhong, Yuan Zhou, Jian Peng

Reinforcement learning from self-play has recently reported many successes. Self-play, where the agents compete with themselves, is often used to generate training data for iterative policy improvement. In previous work, heuristic rules are designed to choose an opponent for the current learner. Typical rules include choosing the latest agent, the best agent, or a random historical agent. However, these rules may be inefficient in practice and sometimes do not guarantee convergence even in the simplest matrix games. In this paper, we propose a new algorithmic framework for competitive self-play reinforcement learning in two-player zero-sum games. We recognize the fact that the Nash equilibrium coincides with the saddle point of the stochastic payoff function, which motivates us to borrow ideas from classical saddle point optimization literature. Our method trains several agents simultaneously, and intelligently takes each other as opponent based on simple adversarial rules derived from a principled perturbation-based saddle optimization method. We prove theoretically that our algorithm converges to an approximate equilibrium with high probability in convex-concave games under standard assumptions. Beyond the theory, we further show the empirical superiority of our method over baseline methods relying on the aforementioned opponent-selection heuristics in matrix games, grid-world soccer, Gomoku, and simulated robot sumo, with neural net policy function approximators.

中文翻译:

高效的竞争性自我对弈策略优化

自我对弈中的强化学习最近报告了许多成功。代理与自己竞争的自我博弈通常用于生成训练数据以进行迭代策略改进。在以前的工作中,启发式规则旨在为当前学习者选择一个对手。典型的规则包括选择最新的代理、最佳代理或随机的历史代理。然而,这些规则在实践中可能效率低下,有时甚至在最简单的矩阵游戏中也不能保证收敛。在本文中,我们提出了一种新的算法框架,用于两人零和游戏中的竞争性自我对弈强化学习。我们认识到纳什均衡与随机收益函数的鞍点重合这一事实,这促使我们从经典的鞍点优化文献中借用思想。我们的方法同时训练多个智能体,并根据基于原则扰动的鞍座优化方法得出的简单对抗性规则智能地将彼此作为对手。我们从理论上证明,我们的算法在标准假设下的凸凹博弈中以高概率收敛到近似均衡。除了理论之外,我们还进一步展示了我们的方法相对于基线方法的经验优势,这些方法依赖于矩阵游戏、网格世界足球、五子棋和模拟机器人相扑中的上述对手选择启发式算法,以及神经网络策略函数逼近器。我们从理论上证明,我们的算法在标准假设下的凸凹博弈中以高概率收敛到近似均衡。除了理论之外,我们还进一步展示了我们的方法相对于基线方法的经验优势,这些方法依赖于矩阵游戏、网格世界足球、五子棋和模拟机器人相扑中的上述对手选择启发式算法,以及神经网络策略函数逼近器。我们从理论上证明,我们的算法在标准假设下的凸凹博弈中以高概率收敛到近似均衡。除了理论之外,我们还进一步展示了我们的方法相对于基线方法的经验优势,这些方法依赖于矩阵游戏、网格世界足球、五子棋和模拟机器人相扑中的上述对手选择启发式算法,以及神经网络策略函数逼近器。
更新日期:2020-09-15
down
wechat
bug