Temporal Induced Self-Play for Stochastic Bayesian Games,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Temporal Induced Self-Play for Stochastic Bayesian Games
arXiv - CS - Computer Science and Game Theory Pub Date : 2021-08-21 , DOI: arxiv-2108.09444
Weizhe Chen, Zihan Zhou, Yi Wu, Fei Fang

One practical requirement in solving dynamic games is to ensure that the players play well from any decision point onward. To satisfy this requirement, existing efforts focus on equilibrium refinement, but the scalability and applicability of existing techniques are limited. In this paper, we propose Temporal-Induced Self-Play (TISP), a novel reinforcement learning-based framework to find strategies with decent performances from any decision point onward. TISP uses belief-space representation, backward induction, policy learning, and non-parametric approximation. Building upon TISP, we design a policy-gradient-based algorithm TISP-PG. We prove that TISP-based algorithms can find approximate Perfect Bayesian Equilibrium in zero-sum one-sided stochastic Bayesian games with finite horizon. We test TISP-based algorithms in various games, including finitely repeated security games and a grid-world game. The results show that TISP-PG is more scalable than existing mathematical programming-based methods and significantly outperforms other learning-based methods.

中文翻译：

随机贝叶斯博弈的时间诱导自我博弈

解决动态博弈的一个实际要求是确保玩家从任何决策点开始都能很好地发挥作用。为了满足这一要求，现有的努力集中在平衡细化上，但现有技术的可扩展性和适用性是有限的。在本文中，我们提出了 Temporal-Induced Self-Play (TISP)，这是一种新颖的基于强化学习的框架，可以从任何决策点开始寻找具有良好表现的策略。TISP 使用信念空间表示、反向归纳、策略学习和非参数逼近。基于 TISP，我们设计了一种基于策略梯度的算法 TISP-PG。我们证明了基于 TISP 的算法可以在有限范围的零和单边随机贝叶斯博弈中找到近似的完美贝叶斯均衡。我们在各种游戏中测试基于 TISP 的算法，包括有限重复安全博弈和网格世界博弈。结果表明，TISP-PG 比现有的基于数学规划的方法更具可扩展性，并且明显优于其他基于学习的方法。

更新日期：2021-08-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文