当前位置:
X-MOL 学术
›
arXiv.cs.GT
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Temporal Induced Self-Play for Stochastic Bayesian Games
arXiv - CS - Computer Science and Game Theory Pub Date : 2021-08-21 , DOI: arxiv-2108.09444 Weizhe Chen, Zihan Zhou, Yi Wu, Fei Fang
arXiv - CS - Computer Science and Game Theory Pub Date : 2021-08-21 , DOI: arxiv-2108.09444 Weizhe Chen, Zihan Zhou, Yi Wu, Fei Fang
One practical requirement in solving dynamic games is to ensure that the
players play well from any decision point onward. To satisfy this requirement,
existing efforts focus on equilibrium refinement, but the scalability and
applicability of existing techniques are limited. In this paper, we propose
Temporal-Induced Self-Play (TISP), a novel reinforcement learning-based
framework to find strategies with decent performances from any decision point
onward. TISP uses belief-space representation, backward induction, policy
learning, and non-parametric approximation. Building upon TISP, we design a
policy-gradient-based algorithm TISP-PG. We prove that TISP-based algorithms
can find approximate Perfect Bayesian Equilibrium in zero-sum one-sided
stochastic Bayesian games with finite horizon. We test TISP-based algorithms in
various games, including finitely repeated security games and a grid-world
game. The results show that TISP-PG is more scalable than existing mathematical
programming-based methods and significantly outperforms other learning-based
methods.
中文翻译:
随机贝叶斯博弈的时间诱导自我博弈
解决动态博弈的一个实际要求是确保玩家从任何决策点开始都能很好地发挥作用。为了满足这一要求,现有的努力集中在平衡细化上,但现有技术的可扩展性和适用性是有限的。在本文中,我们提出了 Temporal-Induced Self-Play (TISP),这是一种新颖的基于强化学习的框架,可以从任何决策点开始寻找具有良好表现的策略。TISP 使用信念空间表示、反向归纳、策略学习和非参数逼近。基于 TISP,我们设计了一种基于策略梯度的算法 TISP-PG。我们证明了基于 TISP 的算法可以在有限范围的零和单边随机贝叶斯博弈中找到近似的完美贝叶斯均衡。我们在各种游戏中测试基于 TISP 的算法,包括有限重复安全博弈和网格世界博弈。结果表明,TISP-PG 比现有的基于数学规划的方法更具可扩展性,并且明显优于其他基于学习的方法。
更新日期:2021-08-24
中文翻译:
随机贝叶斯博弈的时间诱导自我博弈
解决动态博弈的一个实际要求是确保玩家从任何决策点开始都能很好地发挥作用。为了满足这一要求,现有的努力集中在平衡细化上,但现有技术的可扩展性和适用性是有限的。在本文中,我们提出了 Temporal-Induced Self-Play (TISP),这是一种新颖的基于强化学习的框架,可以从任何决策点开始寻找具有良好表现的策略。TISP 使用信念空间表示、反向归纳、策略学习和非参数逼近。基于 TISP,我们设计了一种基于策略梯度的算法 TISP-PG。我们证明了基于 TISP 的算法可以在有限范围的零和单边随机贝叶斯博弈中找到近似的完美贝叶斯均衡。我们在各种游戏中测试基于 TISP 的算法,包括有限重复安全博弈和网格世界博弈。结果表明,TISP-PG 比现有的基于数学规划的方法更具可扩展性,并且明显优于其他基于学习的方法。