Bounds and dynamics for empirical game theoretic analysis

Tuyls, Karl; Perolat, Julien; Lanctot, Marc; Hughes, Edward; Everett, Richard; Leibo, Joel Z.; Szepesvári, Csaba; Graepel, Thore

doi:10.1007/s10458-019-09432-y

Bounds and dynamics for empirical game theoretic analysis

Open access
Published: 04 December 2019

Volume 34, article number 7, (2020)
Cite this article

Download PDF

You have full access to this open access article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Bounds and dynamics for empirical game theoretic analysis

Download PDF

Karl Tuyls ORCID: orcid.org/0000-0001-7929-1944¹,
Julien Perolat¹,
Marc Lanctot³,
Edward Hughes²,
Richard Everett²,
Joel Z. Leibo²,
Csaba Szepesvári³ &
…
Thore Graepel²

5607 Accesses
8 Citations
3 Altmetric
Explore all metrics

Abstract

This paper provides several theoretical results for empirical game theory. Specifically, we introduce bounds for empirical game theoretical analysis of complex multi-agent interactions. In doing so we provide insights in the empirical meta game showing that a Nash equilibrium of the estimated meta-game is an approximate Nash equilibrium of the true underlying meta-game. We investigate and show how many data samples are required to obtain a close enough approximation of the underlying game. Additionally, we extend the evolutionary dynamics analysis of meta-games using heuristic payoff tables (HPTs) to asymmetric games. The state-of-the-art has only considered evolutionary dynamics of symmetric HPTs in which agents have access to the same strategy sets and the payoff structure is symmetric, implying that agents are interchangeable. Finally, we carry out an empirical illustration of the generalised method in several domains, illustrating the theory and evolutionary dynamics of several versions of the AlphaGo algorithm (symmetric), the dynamics of the Colonel Blotto game played by human players on Facebook (symmetric), the dynamics of several teams of players in the capture the flag game (symmetric), and an example of a meta-game in Leduc Poker (asymmetric), generated by the policy-space response oracle multi-agent learning algorithm.

Monte Carlo Tree Search: a review of recent modifications and applications

Article Open access 19 July 2022

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Generalizations of the General Lotto and Colonel Blotto games

Article 20 June 2020

1 Introduction

Using game theory to examine multi-agent interactions in complex systems is a non-trivial task, especially when a payoff table or normal form representation is not directly available. Works by Walsh et al. [39, 40], Wellman et al. [43, 44], and Phelps et al. [23], have shown the great potential of using heuristic strategies and empirical game theory to examine such interactions at a higher strategic meta-level, instead of trying to capture the decision-making processes at the level of the atomic actions involved. Doing this turns the interaction in a smaller normal form game, or heuristic or meta-game, with the higher-level strategies now being the primitive actions of the game, making the complex multi-agent interaction amenable to game theoretic analysis.

Others have built on this empirical game theoretic methodology and applied these ideas to no limit Texas hold’em Poker and various types of double auctions for example, see [16, 22,23,24, 30], showing that a game theoretic analysis at the level of meta-strategies yields novel insights into the type and form of interactions in complex systems.

Major limitations of this empirical game theoretic approach are that it comes without theoretical guarantees on the approximation of the true underlying meta-game (a model of the actual game or interaction) by an estimated meta-game based on sampled data or simulations, and that it is unclear how many data samples are required to achieve a good approximation. Additionally, when examining the evolutionary dynamics of these games the method remains limited to symmetric situations, in which the agents or players have access to the same set of strategies, and are interchangeable. One approach is to ignore asymmetry (types of players), and average over many samples of types resulting in a single expected payoff to each player in each entry of the meta-game payoff table. Many real-world situations though are asymmetric in nature and involve various roles for the agents that participate in the interactions. For instance, buyers and sellers in auctions, or games such as Scotland Yard [21], but also different roles in e.g. robotic soccer (defender vs striker) [29] and even natural language (hearer vs speaker). This type of analysis comes without strong guarantees on the approximation of the true underlying meta-game by an estimated meta-game based on sampled data, and remains unclear about how many data samples are required to achieve a good approximation.

In this paper we address these problems. We use the fact that a Nash equilibrium of the estimated game is a $2 \epsilon $-Nash equilibrium of the underlying meta-game, showing that we can closely approximate the real Nash equilibrium as long as we have enough data samples from which to build the meta-game payoff table. Furthermore, we also examine how many data samples are required to confidently approximate the underlying meta-game. We also show how to generalise the heuristic payoff or meta-game method introduced by Walsh et al. to two-population asymmetric games.

Finally, we illustrate the generalised method in several domains. We carry out an experimental illustration on the AlphaGo algorithm [27], Colonel Blotto [17], Capture the Flag (CTF) and an asymmetric Leduc poker game. In the AlphaGo experiments we show how a symmetric meta-game analysis can provide insights into the evolutionary dynamics and strengths of various versions of the AlphaGo algorithm while it was being developed, and how intransitive behaviour can occur by introducing a non-related strategy. In the Colonel Blotto game we illustrate how the methodology can provide insights into how humans play this game, constructing several symmetric meta-games from data collected on Facebook. In the CTF game we examine the dynamics of teams of two agents playing the Capture the Flag game, show examples of intransitive behaviours occurring between these advanced agents and illustrate how Elo rating ([8]) is incapable of capturing such intransitive behaviours. Finally, we illustrate the method in Leduc poker, by examining an asymmetric meta-game, generated by a recently introduced multiagent reinforcement learning algorithm, policy-space response oracles (PSRO) [18]. For this analysis we rely on some theoretical results that connect an asymmetric normal form game to its symmetric counterparts [32].

2 Related work

The purpose of the first applications of empirical game-theoretic analysis (EGTA) was to reduce the complexity of large economic problems in electronic commerce, such as continuous double auctions, supply chain management, market games, and automated trading [39, 44]. While these complex economic problems continue to be a primary application area of these methods [5, 37, 38, 41], the general technique has been applied in many different settings. These include analysis interaction among heuristic meta-strategies in poker [24], network protocol compliance [43], collision avoidance in robotics [11], and security games [20, 25, 48]. Research that followed on Walsh’s [39] initial work branched off in two directions: the first strand of work focused on strategic reasoning for simulation-based games [44], while the second strand focused on the evolutionary dynamical analysis of agent behavior inspired by evolutionary game theory [31, 33]. The initial paper of Walsh et al. contained innovative ideas that resulted in both research strands taking off in slightly different directions. The current paper is situated in the second line of work focusing on the evolutionary dynamics of empirical or meta-games.

Evolutionary dynamics (foremost replicator dynamics) have often been presented as a practical tool for analyzing interactions among meta-strategies found in EGTA [2, 11, 39], and for studying the change in policies of multiple learning agents [3], as the EGTA approach is largely based on the same assumptions as evolutionary game-theory, viz. repeated interactions among sub-groups sampled independently at random from an arbitrarily-large population of agents. Also several approaches have investigated the use game-theoretic models, in combination with multi-agent learning, for understanding human learning in multi-agent systems, see e.g. [9, 26]. There have also been several uses of EGTA in the context of multiagent reinforcement learning. For example, reinforcement learning can be used to find a best response using an succinct policy representation [15], which can be used to validate equilibria found in EGTA [47], as a regularization mechanism to learn more general meta-strategies than independent learners [18], or to determine the stability of non-adaptive trading strategies such as zero intelligence [49].

A major component of the EGTA paradigm is the estimation of the meta-game that acts as an approximation of the more complex underlying meta-game (like sequential games for example). The quality of the analyses and strategies derived from these estimates depend crucially on the quality of the approximation. The first preferential sampling scheme suggested using an information-theoretic value of information criterion to focus the Monte Carlo samples [40]. Other initial approaches to efficient estimation, mentioned in [44], used regression to generalize the payoff of several different complex strategy profiles [36]. Stochastic search methods, such as simulated annealing, were also proposed as means to obtain Nash equilibrium approximations from simulation-based games [35]. More recent work also suggests player reductions that preserve deviations with granular subsampling of the strategy space to get higher-quality information from a finite number of samples [46]. Finally, there is an online tool that helps with managing EGTA experiments [6], which employs a sampling procedure that prioritizes by the estimated regret of the corresponding strategies, which is known to approach the true regret of the underlying game [34]. Despite this, the authors of [6] claim, to the best of their knowledge, that “the construction of optimal sequential sampling procedures for EGTA remains an open question”. This work addresses this question of sampling given current estimates and their errors.

3 Preliminaries

In this section, we introduce the necessary background to describe our game theoretic meta-game analysis of the repeated interaction between p players. For the sake of completeness we also provide some theoretical properties of heuristic payoff tables in the appendix of the paper (see “Appendix A”), that have not been treated in the literature before, but point out that this can be easily skipped as the main results can be understood without this section.

3.1 Normal form games

In a p-player Normal Form Game (NFG), players are involved in a single round strategic interaction. Each player $i\in [p] \doteq \{1,\dots ,p\}$ chooses a ‘strategy’ $\pi ^i\in [k_i]$ from a set of $k_i$ strategies and receives a payoff $r^i(\pi ^1, \dots , \pi ^p)\in \mathbb {R}$. For the sake of simplicity, we will write $\varvec{\pi }$ for the joint strategy $(\pi ^1,\dots ,\pi ^p)\in [k]^p$ and $\varvec{r}(\varvec{\pi })$ for the joint reward $(r^1(\varvec{\pi }), \dots ,r^p(\varvec{\pi }))$. Then a p-player NFG is a tuple $G=(r^1, \dots , r^p)$. Players are also allowed to randomize in which case player i chooses a probability distribution $x^i\in \varDelta _{k_i-1} \doteq \{ x\in [0,1]^{k_i}\,:\, \sum _{j=1}^{k_i} x_j=1 \}$ over $[k_i]$ and the players receive the expected payoff under the joint strategy $\varvec{x} = (x^1,\dots ,x^p)$. In particular, player i’s expected payoff is

$$\begin{aligned} E_{\varvec{\pi }\sim \varvec{x}}[r^i(\pi ^1,\dots ,\pi ^p)] \doteq \sum _{i_1=1}^{k_1} \dots \sum _{i_p=1}^{k_p} x^1_{i_1} \dots x^p_{i_p} r^i(i_1,\dots ,i_p)\,. \end{aligned}$$

A symmetric NFG captures interactions where payoffs depend on what strategies are played but not on who plays them. The first condition is therefore that the strategy sets are the same for all players, (i.e.$\forall i,j \; k_i=k_j$ and will be written k). The second condition is that if a permutation is applied to the joint strategy $\varvec{\pi }$, the joint payoff is permuted accordingly. Formally, a game G is symmetric if for any permutation $\sigma $ of [p], we have $\varvec{r}(\varvec{\pi }_{\sigma }) = \varvec{r}_{\sigma }(\varvec{\pi }) $, where $\varvec{\pi }_{\sigma } = (\pi ^{\sigma (1)}, \dots , \pi ^{\sigma (p)})$ and $\varvec{r}_{\sigma }(\varvec{\pi }) = (r^{\sigma (1)}(\varvec{\pi }), \dots ,r^{\sigma (p)}(\varvec{\pi }))$. To repeat, for a game to be symmetric there are two conditions, the players need to have access to the same strategy set and the payoff structure needs to be symmetric, such that players are interchangeable. If one of these two conditions is violated the game is asymmetric.

In the asymmetric case our analysis will focus on the two-player case (two roles) and thus we introduce specific notations for the sake of simplicity. In a two-player normal-form game, each player’s payoff can be seen as a $k_1 \times k_2$ matrix. We will write $A = (a_{uv})_{u\in [k_1],v\in [k_2]}$ for the payoff matrix of player one (i.e.$a_{uv} = r^1(u, v)$) and $B = (b_{uv})_{u\in [k_1],v\in [k_2]}$ for the payoff matrix of player two (i.e.$b_{uv} = r^2(u, v)$).

In the end, a two player NFG is defined by the tuple $G=(A, B)$.

3.2 Nash equilibrium

In a two-player game, a pair of strategies $(x,y)\in \varDelta _{k_1-1}\times \varDelta _{k_2-1}$ is a Nash equilibrium of the game (A, B) if no player has an incentive to switch from their current strategy. In other words, (x, y) is a Nash equilibrium if $x^\top A y = \max Ay$ and $x^\top B y = \max x^\top B$, where for a vector u (row-, or column-vector), we define $\max u = \max _i u_i$.

Evolutionary game theory often considers a single strategy x that plays against itself. In this situation, the game is said to have a single population. These situations are often called in the literature single population games. In a single population game, x is a Nash equilibrium if $x^\top A x = \max Ax$.

3.3 Replicator dynamics

Replicator Dynamics are one of the central concepts from Evolutionary Game Theory [10, 12, 19, 42, 50, 51]. They describe how a population of replicators, or a strategy profile, evolves in the midst of others through time under evolutionary pressure. Each replicator in the population is of a certain type, and they are randomly paired in interaction. Their reproductive success is determined by their fitness, which results from these interactions. The replicator dynamics express that the population share of a certain type will increase if the replicators of this type have a higher fitness than the population average; otherwise their population share will decrease. This evolutionary process is described according to a first order dynamical system. In a two-player NFG (A, B), the replicator equations are defined as follows:

$$\begin{aligned}&{\dot{x}}_u = x_u \left( (A y)_u - x^\top A y\right) \,,&{\dot{y}}_{v} = y_{v} \left( (x^\top B)_{v} - x^\top B y \right) \end{aligned}$$

(1)

with $\mathbf{x }\in \varDelta _{k_1-1}$, $\mathbf{y }\in \varDelta _{k_2-1}$. The dynamics defined by these two coupled differential equations changes the strategy profile to increase the probability of the strategies that have the best return or are the fittest.

In the case of a symmetric two-player game ($A=B^\top $), the replicator equations assume that both players play the same strategy profile (i.e. player one and two play according to x) and the dynamics are defined as follows:

$$\begin{aligned}&{\dot{x}}_l = x_l \left( (A x)_l - x^\top A x\right) \end{aligned}$$

(2)

3.4 Meta games and heuristic payoff tables

A meta game (or empirical game) is a simplified model of a complex multi-agent interaction. In order to analyze complex multi-agent systems like poker, we do not consider all possible atomic actions but rather a set of relevant meta-strategies that are often played [24]. These meta strategies (or sometimes styles of play), over atomic actions, are commonly played by players such as for instance “passive/aggressive” or “tight/loose” in poker. A p-type meta game is now a p-player repeated NFG where players play a limited number of meta strategies. Following our poker example, the strategy set of the meta game will now be defined as the set $\{$“aggressive”, “tight”, “passive”$\}$ and the reward function as the outcome of a game between p-players using different profiles.

When a NFG representation of such a complex multi-agent interaction is not available, one can use the heuristic payoff table (HPT), as introduced in Walsh et al. [39, 40]. The idea of the HPT is to capture the expected payoff of high-level meta-strategies through simulation, or from data of interactions, when the payoffs are not readily available (e.g. through a given NFG). Note that the purpose of the HPT is not to directly apply it to simple known matrix games - in that case one can just plug the normal form game directly in the replicator equations. Continuous-time replicator dynamics assume an infinite population, which is approximated in the HPT method by a finite population of p individuals to be able to run simulations. As such, the HPT is only an approximation. The larger p gets, the more subtleties are captured by the HPT and the resulting dynamics will be more accurately reflecting the underlying true dynamics.

If we were to construct a classical payoff table for ${\mathbf {r}}$ we would require $k^p$ entries in the NFG table and this becomes large very quickly. Since all players can choose from the same strategy set and all players receive the same payoff for being in the same situation, we can simplify our payoff table. This means we consider a game where the payoffs for playing a particular strategy depend only on the other strategies employed by the other players, but not on who is playing them. This corresponds to the setting of symmetric games.

We now introduce the HPT. Let N be a matrix, where each row $N_i$ is a vector of counts $(n_1,\dots ,n_k)$ where $\sum _j n_j=p$: $n_j$ indicates how many of the p players play strategy j. The number of such distinct count vectors (which we also view as a discrete distribution) can be shown to be $m=\left( {\begin{array}{c}p+k-1\\ p\end{array}}\right) $, which is the number of rows of N. Each distribution over strategies can be simulated (or derived from data), returning a vector of expected rewards $u(N_i)$ (one for each of the k strategies). Let U be an $m\times k$ matrix which captures the payoffs corresponding to the rows in N, i.e., $U_i = u(N_i)$. We refer to an HPT as $M = (N, U)$. Note that normalizing a count vector $(n_1,\dots ,n_k)$ by dividing it by p gives a probability vector $\varvec{x}=(n_1/p,\dots ,n_k/p)$, which we call a discrete strategy distribution.

Suppose we have a meta-game with 3 meta-strategies ($k=3$) and 6 players ($p=6$) that interact in a 6-type game, this leads to a meta game payoff table of 28 entries (which is a good reduction from $3^6$ cells). An important advantage of this type of table is that it easily extends to many agents, as opposed to the classical payoff matrix. Table 1 provides an example for three strategies and three players. The left-hand side shows the counts and gives the matrix N, while the right-hand side gives the payoffs for playing any of the strategies given the discrete profile and corresponds to matrix U.

Table 1 An example of a meta game payoff table

Bounds and dynamics for empirical game theoretic analysis

Abstract

Similar content being viewed by others

Monte Carlo Tree Search: a review of recent modifications and applications

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Generalizations of the General Lotto and Colonel Blotto games

1 Introduction

2 Related work

3 Preliminaries

3.1 Normal form games

3.2 Nash equilibrium

3.3 Replicator dynamics

3.4 Meta games and heuristic payoff tables

4 Method

4.1 Symmetric meta games

4.2 Asymmetric meta games

4.3 Linking symmetric and asymmetric games

Theorem 1

5 Theoretical insights

5.1 Main lemma

Definition

Definition

Lemma

Proof

5.2 Finite sample analysis

5.2.1 The batch scenario

5.2.2 Uniform sampling

6 Experiments

6.1 AlphaGo

6.1.1 Experiment 1: Strong strategies

6.1.2 Experiment 2: Evolution and transitivity of strengths

6.1.3 Experiment 3: Cyclic behaviour

6.2 Colonel Blotto

6.2.1 Experiment 1: Top performing strategies

6.2.2 Experiment 2: Most frequently played strategies

6.3 Capture the flag

6.3.1 Experiment 1: Strategies throughout training

6.3.2 Experiment 2: Cyclic behavior

6.3.3 Experiment 3: Elo scores

6.4 PSRO-generated meta-game

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendix: Supplemental material

Appendix: Supplemental material

1.1 Theoretical properties of heuristic payoff tables

1.1.1 Equivalence with mixed strategy dynamics

Lemma 1

Proof

1.1.2 Example: 2-dimensional plots for Capture The Flag

1.1.3 Approximation of zero-sum NFGs by HPTs

Lemma 2

Proof

Corollary 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation