当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Sample-Efficient Learning of Stackelberg Equilibria in General-Sum Games
arXiv - CS - Artificial Intelligence Pub Date : 2021-02-23 , DOI: arxiv-2102.11494
Yu Bai, Chi Jin, Huan Wang, Caiming Xiong

Real world applications such as economics and policy making often involve solving multi-agent games with two unique features: (1) The agents are inherently asymmetric and partitioned into leaders and followers; (2) The agents have different reward functions, thus the game is general-sum. The majority of existing results in this field focuses on either symmetric solution concepts (e.g. Nash equilibrium) or zero-sum games. It remains vastly open how to learn the Stackelberg equilibrium -- an asymmetric analog of the Nash equilibrium -- in general-sum games efficiently from samples. This paper initiates the theoretical study of sample-efficient learning of the Stackelberg equilibrium in two-player turn-based general-sum games. We identify a fundamental gap between the exact value of the Stackelberg equilibrium and its estimated version using finite samples, which can not be closed information-theoretically regardless of the algorithm. We then establish a positive result on sample-efficient learning of Stackelberg equilibrium with value optimal up to the gap identified above. We show that our sample complexity is tight with matching upper and lower bounds. Finally, we extend our learning results to the setting where the follower plays in a Markov Decision Process (MDP), and the setting where the leader and the follower act simultaneously.

中文翻译:

普通和博弈中Stackelberg均衡的样本有效学习

诸如经济学和政策制定等现实世界中的应用通常涉及解决具有两个独特功能的多主体游戏:(1)主体本质上是不对称的,分为领导者和跟随者;(2)代理商具有不同的奖励功能,因此博弈为一般和。该领域中的大多数现有结果都集中在对称解概念(例如纳什均衡)或零和博弈上。在一般和博弈中如何有效地从样本中学习Stackelberg平衡(纳什平衡的不对称模拟)仍然是一个很大的难题。本文启动了基于两人回合制一般和博弈中Stackelberg均衡的样本有效学习的理论研究。我们使用有限样本来确定Stackelberg平衡的精确值与其估计值之间的基本差距,而无论采用哪种算法,都无法通过信息理论将其封闭。然后,我们在Stackelberg平衡的样本有效学习中建立了一个积极的结果,其最优值可达到上述差距。我们表明,样本的复杂性在匹配上下限时是紧密的。最后,我们将学习结果扩展到跟随者在马尔可夫决策过程(MDP)中扮演的环境以及领导者和跟随者同时行动的环境。我们证明,样本的复杂性在匹配上下限时是紧密的。最后,我们将学习结果扩展到跟随者在马尔可夫决策过程(MDP)中扮演的环境以及领导者和跟随者同时行动的环境。我们表明,样本的复杂性在匹配上下限时是紧密的。最后,我们将学习结果扩展到跟随者在马尔可夫决策过程(MDP)中扮演的环境,以及领导者和跟随者同时行动的环境。
更新日期:2021-02-24
down
wechat
bug