Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication

Pesce, Emanuele; Montana, Giovanni

doi:10.1007/s10994-019-05864-5

Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication

Open access
Published: 23 January 2020

Volume 109, pages 1727–1747, (2020)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication

Download PDF

6577 Accesses
26 Citations
5 Altmetric
Explore all metrics

Abstract

Deep reinforcement learning algorithms have recently been used to train multiple interacting agents in a centralised manner whilst keeping their execution decentralised. When the agents can only acquire partial observations and are faced with tasks requiring coordination and synchronisation skills, inter-agent communication plays an essential role. In this work, we propose a framework for multi-agent training using deep deterministic policy gradients that enables concurrent, end-to-end learning of an explicit communication protocol through a memory device. During training, the agents learn to perform read and write operations enabling them to infer a shared representation of the world. We empirically demonstrate that concurrent learning of the communication device and individual policies can improve inter-agent coordination and performance in small-scale systems. Our experimental results show that the proposed method achieves superior performance in scenarios with up to six agents. We illustrate how different communication patterns can emerge on six different tasks of increasing complexity. Furthermore, we study the effects of corrupting the communication channel, provide a visualisation of the time-varying memory content as the underlying task is being solved and validate the building blocks of the proposed memory device through ablation studies.

A survey of multi-agent deep reinforcement learning with communication

Article Open access 06 January 2024

Multi-agent Neural Reinforcement-Learning System with Communication

Learning controlled and targeted communication with the centralized critic for the multi-agent system

Article 04 November 2022

1 Introduction

Reinforcement Learning (RL) allows agents to learn how to map observations to actions through feedback reward signals (Sutton and Barto 1998). Recently, deep neural networks (LeCun et al. 2015; Schmidhuber 2015) have had a noticeable impact on RL (Li 2017). They provide flexible models for learning value functions and policies, allow to overcome difficulties related to large state spaces, and eliminate the need for hand-crafted features and ad-hoc heuristics (Cortes et al. 2002; Parker et al. 2003; Olfati-Saber et al. 2007). Deep reinforcement learning (DRL) algorithms, which usually rely on deep neural networks to approximate functions, have been successfully employed in single-agent systems, including video game playing (Mnih et al. 2015), robot locomotion (Lillicrap et al. 2015), object localisation (Caicedo and Lazebnik 2015) and data-center cooling (Evans and Gao 2016).

Following the uptake of DRL in single-agent domains, there is now a need to develop improved learning algorithms for multi-agent (MA) systems where additional challenges arise. Markov Decision Processes, upon which DRL methods rely, assume that the reward distribution and dynamics are stationary (Hernandez-Leal et al. 2017). When multiple learners interact with each other, this property is violated because the reward that an agent receives also depends on other agents’ actions (Laurent et al. 2011). This issue, known as the moving-target problem (Tuyls and Weiss 2012), removes convergence guarantees and introduces additional learning instabilities. Further difficulties arise from environments characterized by partial observability (Singh et al. 1994; Chu and Ye 2017; Peshkin et al. 2000) whereby the agents do not have full access to the world state, and where coordination skills are essential.

An important challenge in multi-agent DRL is how to facilitate communication amongst interacting agents. Communication is widely known to play a critical role in promoting coordination between humans (Számadó 2010). Humans have been proven to excel at communicating even in absence of a conventional code (De Ruiter et al. 2010). When coordination is required and no common languages exist, simple communication protocols are likely to emerge (Selten and Warglien 2007). Human communication involves more than sending and receiving messages, it requires specialized interactive intelligence where receivers have the ability to recognize intentions and senders can properly design messages (Wharton 2003). The emergence of communication has been widely investigated (Garrod et al. 2010; Theisen et al. 2010), for example new signs and symbols can emerge when it comes to represent real concepts. Fusaroli et al. (2012) demonstrated that language can be seen as a social coordination device learnt through reciprocal interaction with the environment for optimizing coordinative dynamics. The relation between communication and coordination has been widely discussed (Vorobeychik et al. 2017; Demichelis and Weibull 2008; Miller and Moser 2004; Kearns 2012). Communication is an essential skill in many tasks: for instance, in critical situations, where is of fundamental importance to properly manage critical and urgent situations, like emergency response organizations (Comfort 2007). In multiplayer videogames, it is often essential to reach a sufficiently high level of coordination required to succeed (Chen 2009).

Two-agents systems have often been studied when looking at the effects of communication on coordination. Galantucci (2005) showed that humans can easily produce new protocols to overcame the lack of a common language, through experiments in which pairs of participants playing video games were allowed to send messages through a medium that prevented the use of standard symbols. In two-players games, like the Battle of the Sexes, improved coordination resulted when allowing gamers to exchange messages (Cooper et al. 1989). Human conversations can be interpreted as a bi-directional communication form, where the same entity can both send and receive messages (Lasswell 1948). This kind of communication can be efficiently explored in small-scale systems through coordination games (Cooper et al. 1992) and often it is the key to achieve success in real world scenarios such as bargaining with incomplete information (Brosig et al. 2003).

Analogously, the importance of communication has been recognised when designing artificial MA learning systems, especially in tasks requiring synchronization (Scardovi and Sepulchre 2008; Wen et al. 2012). For example, in navigation tasks, agents can localise each other more easily through shared information (Fox et al. 2000). In group strategy coordination, such as automating negotiations, communication is fundamental to improve the final outcome (Wunder et al. 2009; Itō et al. 2011). A wide range of MA applications have benefitted from inter-agent message passing including distributed smart grid control (Pipattanasomporn et al. 2009), consensus in networks (You and Xie 2011), multi-robot control (Ren and Sorensen 2008), autonomous vehicle driving (Petrillo et al. 2018), elevators control (Crites and Barto 1998), soccer-playing robots (Stone and Veloso 1998) and for language learning in two-agent systems (Lazaridou et al. 2016).

Recently, Lowe et al. (2017) have proposed MADDPG (Multi-Agent Deep Deterministic Policy Gradient). Their approach extends the actor-critic algorithm (Degris et al. 2012) in which each agent has an actor to select actions and a critic to evaluate them. MADDPG embraces the centralised learning and decentralised execution paradigm (CLDE) (Foerster et al. 2016; Kraemer and Banerjee 2016; Oliehoek and Vlassis 2007). During centralised training, the critics receive observations and actions from all the agents whilst the actors only see their local observations. On the other hand, the execution only relies on actors. This approach has been designed to address the emergence of environment non-stationarity (Tuyls and Weiss 2012; Laurent et al. 2011) and has been shown to perform well in a number of mixed competitive and cooperative environments. In MADDPG, the agents can only share each other’s actions and observations during training through their critics, but do not have the means to develop an explicit form of communication through their experiences. The input size of each critic increases linearly with the number of agents (Lowe et al. 2017), which hinders its scalability (Jiang and Lu 2018).

In this article, we consider tasks requiring strong coordination and synchronization skills. In order to thoroughly study the effects of communication on these scenarios, we focus on small-scale systems. This allows us to design tasks with an increasing level of complexity, and simplifies the investigation of possible correlations between the level of messages being exchanged and any environmental changes. We provide empirical evidences that the proposed method reaches very good performance on a range of two-agent scenarios when a high level of cooperation is required. In addition, we present experimental results for systems with up to six agents in the Supplementary Material (Section B.2 and B.3). In the real world, there is range of applications involving the coordination of only a few actors, for example motor interactions like sawing or cooperative lifting of heavy loads (Jarrassé et al. 2012).

In such cases, being able to communicate information beyond the private observations, and infer a shared representation of the world through interactions, becomes essential. Ideally, an agent should be able to remember its current and past experience generated when interacting with the environment, learn how to compactly represent these experiences in an appropriate manner, and share this information for others to benefit from. Analogously, an agent should be able to learn how to decode the information generated by other agents and leverage it under every environmental state. The above requirements are captured here by introducing a communication mechanism facilitating information sharing within the CLDE paradigm. Specifically, we provide the agents with a shared communication device that can be used to learn from their collective private observations and share relevant messages with others. Each agent also learns how to decode the memory content in order to improve its own policy. Both the read and write operations are implemented as parametrised, non-linear gating mechanisms that are learned concurrently with the individual policies. When the underlying task to be solved demands for complex coordination skills, we demonstrate that our approach can achieve higher performance compared to the MADDPG baseline in small-scale systems. Furthermore, we demonstrate that being able to learn end-to-end a communication protocol jointly with the policies can also improve upon a meta-agent approach whereby all the agents perfectly share all their observations and actions in both training and execution. We investigate a potential interpretation of the communication patterns that have emerged when training two-agent systems through time-varying low-dimensional projections and their visual assessment, and demonstrate how these patterns correlate with the underlying tasks being learned.

This article is organised as follow. In Sect. 2 a general overview of related work is offered to characterize state-of-the-art approaches for MARL with special focus on communication systems. Section 3 contains the formalization of the problem setup, the details of the proposed method and the description of the learning process; all the experiments are reported in Sect. 4 where results are presented in terms of numerical metrics to evaluate the performance achieved on six different scenarios; an analysis of the communication channel is provided to support qualitative insights of the exchanged messages. Conclusive comments are given in Sect. 5. In the Supplementary Material, Section A describes details of MA-MADDPG, a comparative method, and Section B presents a range of additional experiments to further investigate the effects of memory corruption; changes in performance when increasing the number of agents; an ablation study to validate the components used in the proposed method; box plots with the main results, an assessment of the robustness of the method when changing the random seeds; additional analyses of the communication channel.

2 Related work

The problem of reinforcement learning in cooperative environments has been studied extensively (Littman 1994; Schmidhuber 1996; Panait and Luke 2005; Matignon et al. 2007). Early attempts exploited single-agent techniques like Q-learning to train all agents independently (Tan 1993), but suffered from the excessive size of the state space resulting from having multiple agents. Subsequent improvements were obtained using variations of Q-learning (Ono and Fukumoto 1996; Guestrin et al. 2002) and distributed approaches (Lauer and Riedmiller 2000). More recently, DRL techniques like DQN (Mnih et al. 2013) have led to superior performance in single-agents settings by approximating policies through deep neural networks. Tampuu et al. (2017) have demonstrated that an extension of the DQN is able to train multiple agents independently to solve a popular two-agent system, the Pong game. Gupta et al. (2017) have analyzed the performance of popular DRL algorithms, including DQN, DDPG (Lillicrap et al. 2015), TRPO (Schulman et al. 2015) and actor-critic on different MA environments, and have introduced a curriculum learning approach to increase scalability. Foerster et al. (2017) have suggested using a centralized critic for all agents that marginalises out a single’s agent action while other agents’ actions are kept fixed. Iqbal and Sha (2019) proposed MAAC (Multi-Actor-Attention-Critic), a framework for learning decentralised policies with centralised critics, which selects relevant information for each agent at every time-step through an attention mechanism. In more recent work, a probabilistic recursive reasoning framework has been proposed to model behaviours in a two-agents context; each agent, through variational Bayes methods, approximates the other agent policy to predict its strategy and then to improve its own policy (Wen et al. 2019).

The role of communication in cooperative settings has also been explored, and different methods have been proposed differing on how the communication channels have been formulated using DRL techniques. Many approaches rely on implicit communication mechanisms whereby the weights of the neural networks used to implement policies or action-value functions are shared across agents or modelled to allow inter-agent information flow. For instance, in CommNet (Sukhbaatar et al. 2016), the policies are implemented through subsets of units of a large feed-forward neural network mapping the inputs of all agents to actions. At any given time step, the hidden states of each agent are used as messages, averaged and sent as input for the next layer. Singh et al. (2019) proposed IC3NEt, a model designed to improve CommNet, where the hidden states of the agents are also used as messages, but this time they are averaged only after being weighted by a gating mechanism. In addition, in IC3Net, each agent is implemented through an Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) in order to consider the history of the seen observations. In BiCNet (Peng et al. 2017), the agents’ policies and value networks are connected through bidirectional neural networks, and trained using an actor-critic approach. Jiang and Lu (2018) proposed an attention mechanism that, when a need for communication emerges, selects which subsets of agents should communicate; the hidden states of their policy networks are integrated through an LSTM to generate a message that is used as input for the next layer of the policy network. Das et al. (2018) utilised a soft attention mechanism to allow the agents to select the recipients of their messages. Each agent, along with the message, broadcasts a signature which can be used to encode agent-specific information. Kong et al. (2017) introduce a master-slave architecture whereby a master agent provides high-level instructions to organize the slave agents in an attempt to achieve fine-grained optimality. Similarly, in Feudal Multiagent Hierarchies (Ahilan and Dayan 2019), an agent acts as “manager”and learns to communicate sub-goals to multiple workers operating simultaneously. A different approach is instead provided by the Bayesian Action Decoder (BAD) (Foerster et al. 2018), a technique for two-agent settings where an approximate Bayesian update is used to produce public belief that directly conditions the actions of all agents.

In our work, we introduce a mechanism to generate explicit messages capturing relevant aspects of the world, which the agents are able to collectively learn using their observations and interactions. The messages are then sent and received to complement their private observations when making decisions. Some aspects of our approach are somewhat related to DIAL (Differentiable Inter-Agent Learning) (Foerster et al. 2016) in that the communication is enabled by differentiable channels allowing the gradient of the Q-function to bring the proper feedback in small-scale scenarios. Like DIAL, we would like the agents to share explicit messages. However, whereas DIAL uses simple and pre-determined protocols, our agents are given the ability to infer complex protocols from experience, without necessarily relying on pre-defined ones, and utilise those to learn better policies. Explicit messages are also used in SchedNet (Kim et al. 2019) to investigate situations where the bandwidth is limited and only some of the agents are allowed to communicate. In their approach, also focusing on small-case scenarios to better capture the scheduling constraints, the agents produce messages by encoding their observations and a scheduler decides whether an agent is allowed to use a communication channel. A limited bandwidth channel is also used in our work, but all the agents have full access to the channel.

3 Memory-driven MADDPG

3.1 Problem setup

We consider a system with N interacting agents, where N is typically small, and adopt a multi-agent extension of partially observable Markov decision processes (Littman 1994). This formulation assumes a set, $ {\mathcal {S}}$, containing all the states characterising the environment; a sequence $\{{\mathcal {A}}_1, {\mathcal {A}}_2, \dots , {\mathcal {A}}_N\}$ where each ${\mathcal {A}}_i$ is a set of possible actions for the $i^{th}$ agent; a sequence $\{{\mathcal {O}}_1, {\mathcal {O}}_2, \dots , {\mathcal {O}}_N\}$ where each $ {\mathcal {O}}_i$ contains the observations available to the $i^{th}$ agent. Each $\varvec{o}_i \in {\mathcal {O}}_i$ provides a partial characterisation of the current state and is private for that agent. Every action $a_i \in {\mathcal {A}}_i$ is deterministically chosen accordingly to a policy function, $ \varvec{\mu }_{\theta _i}: {\mathcal {O}}_i \mapsto A_i $, parametrised by $\theta _i$. The environment generates a next state according to a transition function, $ {\mathcal {T}}: S \times {\mathcal {A}}_1 \times {\mathcal {A}}_2 \times \dots \times {\mathcal {A}}_N $, that considers the current state and the N actions taken by the agents. The reward received by an agent, $ r_i : {\mathcal {S}} \times {\mathcal {A}}_1 \times {\mathcal {A}}_2 \times \dots \times {\mathcal {A}}_N \mapsto {\mathbb {R}}$ is a function of states and actions. Each agent learns a policy that maximises the expected discounted future rewards over a period of T time steps, $J(\theta _i) = {\mathbb {E}} [R_i]$, where $R_i = \sum _{t=0}^{T} \gamma ^t r_i(s^t_i,a^t_i)$ is the $\gamma $-discounted sum of future rewards. During training, we would like an agent to learn by using not only its own observations, but through a collectively learned representation of the world that accumulates through experiences coming from all the agents. At the same time, each agent should develop the ability to interpret this shared knowledge in its own unique way as needed to optimise its policy. Finally, the information sharing mechanism would need to be designed in such a way to be used in both training and execution.

3.2 Memory-driven communication

We introduce a shared communication mechanism enabling agents to establish a communication protocol through a memory device $ {\mathcal {M}} $ of pre-determined capacity M (Fig. 1). The device is designed to store a message $ {\mathbf {m}} \in {\mathbb {R}}^M $ which progressively captures the collective knowledge of the agents as they interact. An agent’s policy becomes $ \varvec{\mu }_{\theta _i}: {\mathcal {O}}_i \times {\mathcal {M}} \mapsto A_i $, i.e. it is dependent on the agent’s private observation as well as the collective memory. Before taking an action, each agent accesses the memory device to initially retrieve and interpret the message left by others. After reading the message, the agent performs a writing operation that updates the memory content. During training, these operations are learned without any a priori constraint on the nature of the messages other than the device’s size, M. During execution, the agents use the communication protocol that they have learned to read and write the memory over an entire episode. We aim to build a model trainable end-to-end only through reward signals, and use neural networks as function approximators for policies, and learnable gated functions as mechanisms to facilitate an agent’s interactions with the memory. The chosen parametrisations of these operations are presented and discussed below.

Encoding operation Upon receiving its private observations, each agent maps them on to an embedding representing the agent’s current vision of the state:

$$\begin{aligned} {\mathbf {e}}_i = \varphi _{\theta _{i}^e}^{enc}(\varvec{o}_i), {\mathbf {e}}_i \in {\mathbb {R}}^{E} \end{aligned}$$

(1)

where $ \varphi _{\theta _{i}^e}^{enc} $ is a neural network parametrised by $ \theta _{i}^e $. The embedding $ {\mathbf {e}}_i $ plays a fundamental role in selecting a new action and in the reading and writing phases.

Read operation After encoding the current information, the agent performs a read operation allowing to extract and interpret relevant knowledge that has been previously captured through ${\mathcal {M}}$. By interpreting this information content, the agent has access to what other agents have learned. A context vector ${\mathbf {h}}_i$ is generated to capture spatio-temporal information previously encoded in ${\mathbf {e}}_i$ through a linear mapping,

$$\begin{aligned} {\mathbf {h}}_i = {\mathbf {W}}_i^{h}{\mathbf {e}}_i, {\mathbf {h}}_i \in {\mathbb {R}}^{H}, {\mathbf {W}}_i^{h} \in {\mathbb {R}}^{H \times E} \end{aligned}$$

where $ {\mathbf {W}}_i^{h} $ represent the learnable weights of the linear projection. While ${\mathbf {e}}_i$ is defined as general observation encoder, ${\mathbf {h}}_i$ is specifically designed to extract features for the reading operation. The context vector ${\mathbf {h}}_i$ can be interpreted as an agent’s internal representation that uses the observation embedding ${\mathbf {e}}_i$ to extract information to be utilized by the gating mechanism only (Eq. 2); its output is then used to extract information from the memory. The main function of the context vector is to facilitate the emergence of an internal representation specifically designed for interpreting the memory content during the read phase. An ablation study aimed at investigating the added benefits introduced by ${\mathbf {h}}_i$ is provided in the Supplementary Material (B.4). This study supports our intuition that the context vector is crucial for the proper functioning of the entire framework on more complex environments. The agent observation embedding ${\mathbf {e}}_i$, the reading context vector ${\mathbf {h}}_i$ and the current memory ${\mathbf {m}}$ contain different types of information that are used jointly as inputs to learn a gating mechanism,

$$\begin{aligned} {\mathbf {k}}_i = \sigma ({\mathbf {W}}_i^{k}[{\mathbf {e}}_i,{\mathbf {h}}_i, {\mathbf {m}}]), {\mathbf {k}}_i \in [0,1]^{M},{\mathbf {W}}_i^{k} \in {\mathbb {R}}^{M \times (E + H + M)} \end{aligned}$$

(2)

where $ \sigma ( \cdot ) $ is the sigmoid function and $[{\mathbf {e}}_i,{\mathbf {h}}_i, {\mathbf {m}}]$ means that the three vectors are concatenated. The values of $ {\mathbf {k}}_i $ are used as weights to modulate the memory content and extract the information from it, i.e.

$$\begin{aligned} {\mathbf {r}}_i ={\mathbf {m}} \odot {\mathbf {k}}_i \end{aligned}$$

(3)

where $ \odot $ represents the Hadamard product. ${\mathbf {k}}_i$ takes values in [0, 1] and its role is to potentially downgrade the information stored in memory or even completely discard the current content. Learning agent-specific weights ${\mathbf {W}}_i^h$ and ${\mathbf {W}}_i^k $ means that each agent is able to interpret ${\mathbf {m}}$ in its own unique way. As the reading operation strongly depends on the current observation, the interpretation of ${\mathbf {m}}$ can change from time to time depending on what an agent sees during an episode. Given that ${\mathbf {r}}_i$ depends on ${\mathbf {m}}$ and ${\mathbf {e}}_i$ (from $\varvec{o}_i$ in Eq. 1), we lump all the adjustable parameters into $\theta _{i}^{\zeta } = \{ {\mathbf {W}}_i^h, {\mathbf {W}}_i^k \} $ and write

$$\begin{aligned} {\mathbf {r}}_i = \zeta _{\theta _{i}^{\zeta }}(\varvec{o}_i, {\mathbf {m}}). \end{aligned}$$

(4)

Write operation In the writing phase, an agent decides what information to share and how to properly update the content of the memory whilst taking into account the other agents. The write operation is loosely inspired by the LSTM (Hochreiter and Schmidhuber 1997) where the content of the memory is updated through gated functions regulating what information is kept and what is discarded. Initially, the agent generates a candidate memory content, ${\mathbf {c}}_i$, which depends on its own encoded observations and current shared memory through a non-linear mapping,

$$\begin{aligned} {\mathbf {c}}_i = tanh({\mathbf {W}}_i^{c}[{\mathbf {e}}_i,{\mathbf {m}}]) {\mathbf {c}}_i \in [-1,1]^{M},{\mathbf {W}}_i^{c} \in {\mathbb {R}}^{M \times (E + M)} \end{aligned}$$

where ${\mathbf {W}}_i^{c}$ are weights to learn. An input gate, ${\mathbf {g}}_i$, contains the values used to regulate the content of this candidate while a forget gate, ${\mathbf {f}}_i$, is used to decide what to keep and what to discard from the current $ {\mathbf {m}} $. These operations are described as follows:

$$\begin{aligned} \begin{aligned} {\mathbf {g}}_i = {}&\sigma ({\mathbf {W}}_i^{g}[{\mathbf {e}}_i,{\mathbf {m}}]), {\mathbf {g}}_i \in [0,1]^{M},{\mathbf {W}}_i^{g} \in {\mathbb {R}}^{M \times (E + M)}\\ {\mathbf {f}}_i = {}&\sigma ({\mathbf {W}}_i^{f}[{\mathbf {e}}_i,{\mathbf {m}}]), {\mathbf {f}}_i \in [0,1]^{M},{\mathbf {W}}_i^{f} \in {\mathbb {R}}^{M \times (E + M)}. \end{aligned} \end{aligned}$$

The $i^{th}$ agent then finally generates an updated message as a weighted linear combination of old and new messages, as follows:

$$\begin{aligned} \mathbf {m'} = {\mathbf {g}}_i \odot {\mathbf {c}}_i + {\mathbf {f}}_i \odot {\mathbf {m}}. \end{aligned}$$

(5)

The update ${\mathbf {m}}'$ is stored in memory ${\mathcal {M}}$ and made accessible to other agents. At each time step, agents sequentially read and write the content of the memory using the above procedure. Since $\mathbf {m'}$ depends on ${\mathbf {m}}$ and $\varvec{e}_i$ (derived from $\varvec{o}_i$ in Eq. 1) we collect all the parameters into $\theta _{i}^{\xi } = \{ {\mathbf {W}}_i^c, {\mathbf {W}}_i^g, {\mathbf {W}}_i^f \}$ and write the writing operation as:

$$\begin{aligned} \mathbf {m'} = \xi _{\theta _{i}^{\xi }}(\varvec{o}_i, {\mathbf {m}}). \end{aligned}$$

(6)

Action selector Upon completing both read and write operations, the agent is able to take an action, $a_i$, which depends on the current encoding of its observations, its own interpretation of the current memory content and its updated version, that is

$$\begin{aligned} a_i = \varphi _{\theta _{i}^a}^{act}({\mathbf {e}}_i,{\mathbf {r}}_i,\mathbf {m'}) \end{aligned}$$

(7)

where $ \varphi _{\theta _{i}^a}^{act} $ is a neural network parametrised by ${\theta _{i}^a}$. The resulting policy function can be written as a composition of functions:

$$\begin{aligned} \varvec{\mu }_{\theta _i}(\varvec{o}_i, {\mathbf {m}}) = \varphi _{\theta _{i}^a}^{act}(\varphi _{\theta _{i}^e}^{enc}(\varvec{o}_i),\zeta _{\theta _{i}^{\zeta }}(\varvec{o}_i, {\mathbf {m}}), \xi _{\theta _{i}^{\xi }}(\varvec{o}_i, {\mathbf {m}})) \end{aligned}$$

(8)

in which $\theta _{i} = \{ \theta _i^a, \theta _i^e, \theta _i^\zeta , \theta _i^\xi \}$ contains all the relevant parameters.

Learning algorithm All the agent-specific policy parameters, i.e. $\theta _{i}$, are learned end-to-end. We adopt an actor-critic model within a CLDE framework (Foerster et al. 2016; Lowe et al. 2017). In the standard actor-critic model (Degris et al. 2012), we have an actor to select the actions, and a critic, to evaluate the actor moves and provide feedback. In DDPG (Silver et al. 2014; Lillicrap et al. 2015), neural networks are used to approximate both the actor, represented by the policy function $\varvec{\mu }_{\omega _i}$, and its corresponding critic, represented by an action-value function $ Q^{\varvec{\mu }_{\omega _i}}: {\mathcal {O}}_i \times {\mathcal {A}}_i \mapsto {\mathbb {R}} $, in order to maximize the objective function $J(\omega _i) = {\mathbb {E}} [R_i]$. This is done by adjusting the parameters $\omega _i$ in the direction of the gradient of $J(\omega _i)$ which can be written as:

$$\begin{aligned} \nabla _{\omega _i} J(\omega _i) = {\mathbb {E}}_{s \sim {\mathcal {D}}} \big [ \nabla _{\omega _i} \varvec{\mu }_{\omega _i}(\varvec{o}_i) \nabla _{a_i} Q^{\varvec{\mu }_{\omega _i}}(\varvec{o}_i,a_i) |_{a_i=\varvec{\mu }_\omega (\varvec{o}_i)} \big ] \end{aligned}$$

The actions a are produced by the actor $\varvec{\mu }_{\omega _i}$, are evaluated by the critic $ Q^{\varvec{\mu }_i} $ which minimises the following loss:

$$\begin{aligned} {\mathcal {L}}(\omega _i) = {\mathbb {E}}_{\varvec{o}_i, a_i, r, \varvec{o}'_i \sim {\mathcal {D}}} \Big [(Q^{\varvec{\mu }_{\omega _i}}(\varvec{o}_i, a_i) - y)^2 \Big ] \end{aligned}$$

where $\varvec{o}'_i$ is the next observation, ${\mathcal {D}}$ is an experience replay buffer which contains tuples $(\varvec{o}_i,\varvec{o}'_i,a,r)$, $y = r + \gamma Q^{\varvec{\mu '}_{\omega }}(\varvec{o}'_i, a'_i)$ represent the target Q-value. $Q^{\varvec{\mu '}_{\omega _i}}$ is a target network whose parameters are periodically updated with the current parameters of $Q^{\varvec{\mu }_{\omega _i}}$ to make training more stable. ${\mathcal {L}}(\omega _i)$ minimises the expectation of the difference between the current and the target action-state function.

In this formulation, as there is no interaction between agents, the policies are learned independently. We adopt the CLDE paradigm by letting the critics $Q^{\varvec{\mu }_{\omega _i}}$ use the observations $ {\mathbf {x}} = (\varvec{o}_1, \varvec{o}_2, \dots , \varvec{o}_N)$ and the actions of all agents, hence:

$$\begin{aligned} \nabla _{\omega _i}J(\varvec{\mu }_{\omega _i}) = {\mathbb {E}}_{{\mathbf {x}}, a \sim {\mathcal {D}}} \Big [\nabla _{\omega _i}\varvec{\mu }_{\theta _i}(\varvec{o}_i) \nabla _{a_i}Q^{\varvec{\mu }_{\omega _i}}({\mathbf {x}},a_1, a_2, \dots , a_N)|_{a_i=\varvec{\mu }_{\omega _i}(\varvec{o}_i)} \Big ] \end{aligned}$$

(9)

where $ {\mathcal {D}} $ contains transitions in the form of $ ( {\mathbf {x}}, {\mathbf {x}}', a_1, a_2, \dots , a_N, r_1, \dots , r_n )$ and $ \mathbf {x'} = (\varvec{o}'_1, \varvec{o}'_2, \dots , \varvec{o}'_N) $ are the next observations of all agents. Accordingly, $ Q^{\varvec{\mu }_{\omega _i}} $ is updated as

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\omega _i) = {}&{\mathbb {E}}_{{\mathbf {x}}, a, r, \mathbf {x'} \sim {\mathcal {D}}} \Big [(Q^{\varvec{\mu }_{\omega _i}}({\mathbf {x}}, a_1, a_2, \dots , a_N) - y)^2 \Big ], \\ y = {}&r_i + \gamma Q^{\varvec{\mu '}_{\omega _i}}(\mathbf {x'}, a'_1, a'_2, \dots , a'_N) \} \end{aligned} \end{aligned}$$

(10)

in which $ a'_1, a'_2, \dots , a'_N $ are the next actions of all agents. By minimising Eq. 10 the model attempts to improve the estimate of the critic $Q^{\varvec{\mu }_{\omega _i}}$ which is used to improve the policy itselfs through Eq. 9. Since the input of the policy described in Eq. 8 is $(\varvec{o}_i, {\mathbf {m}})$ the gradient of the resulting algorithm to maximize $J(\theta _i) = {\mathbb {E}} [R_i]$ can be written as:

$$\begin{aligned} \nabla _{\theta _i}J(\varvec{\mu }_{\theta _i}) ={\mathbb {E}}_{{\mathbf {x}}, a, {\mathbf {m}} \sim {\mathcal {D}}} \Big [\nabla _{\theta _i}\varvec{\mu }_{\theta _i}(\varvec{o}_i, {\mathbf {m}}) \nabla _{a_i}Q^{\varvec{\mu }_{\theta _i}}({\mathbf {x}}, a_1, \dots , a_N)|_{a_i=\varvec{\mu }_{\theta _i}(\varvec{o}_i, {\mathbf {m}})} \Big ] \end{aligned}$$

where $ {\mathcal {D}} $ is a replay buffer which contains transitions in the form of $ ( {\mathbf {x}}, {\mathbf {x}}', a_1, \dots , a_N, {\mathbf {m}}, r_1, \dots , r_n )$. The $ Q^{\varvec{\mu }_{\theta _i}} $ function is updated according to Eq. 10. Algorithm 1 provides the pseudo-code of the resulting algorithm, that we call MD-MADDPG (Memory-driven MADDPG).

3.3 MD-MADDPG decentralised execution

During execution, only the learned actors $\varvec{\mu }_{\theta _1},\varvec{\mu }_{\theta _2}, \dots , \varvec{\mu }_{\theta _N} $ are used to make decisions and select actions. An action is taken in turn by a single agent. The current agent receives its private observations, $\varvec{o}_i$, reads ${\mathcal {M}}$ to extract ${\mathbf {r}}_i$ (Eq. 3), generates the new version of ${\mathbf {m}}$ (Eq. 5), stores it into ${\mathcal {M}}$ and selects its action $a_i$ using $\varvec{\mu }_i$. The policy of the next agent is then driven by the updated memory.

4 Experimental settings and results

4.1 Environments

In this section, we present a battery of six two-dimensional navigation environments (Fig. 2), with continuos space and discrete time. We introduce tasks of increasing complexity, requiring progressively more elaborated coordination skills: five environments are inspired by the Cooperative Navigation problem from the multi-agent particle environment (Lowe et al. 2017; Mordatch and Abbeel 2017) in addition to Waterworld from the SISL suite (Gupta et al. 2017). We focus on two-agents systems to keep the settings sufficiently simple and attempt an initial analysis and interpretation of emerging communication behaviours. A short description of the six environments is in order.

Cooperative navigation (CN) This environment consists of N agents and N corresponding landmarks. An agent’s task is to occupy one of the landmarks whilst avoiding collisions with other agents. Every agent observes the distance to all others agents and landmark positions.

Partial observable cooperative navigation (PO CN) This is based on Cooperative Navigation, i.e. the task and action space are the same, but the agents now have a limited vision range and can only observe a portion of the environment around them within a pre-defined radius.

Synchronous cooperative navigation (Sync CN) The agents need to occupy the landmarks exactly at the same time in order to be positively rewarded. A landmark is declared as occupied when an agent is arbitrarily close to it. Agents are penalised when the landmarks are not occupied at the same time.

Sequential cooperative navigation (Sequential CN) This environments is similar to the previous one, but the agents here need to occupy landmarks sequentially and avoid to reach them simultaneuosly in order to be positively rewarded. Occupying the landmarks at the same time is penalised.

Swapping cooperative navigation (Swapping CN) In this case the task is more complex as it consists of two sub-tasks. Initially, the agents need to reach the landmarks and occupy them at same time. Then, they need to swap their landmarks and repeat the same process.

Waterworld In this environment, two agents with limited range vision have to collaboratively capture food targets whilst avoiding poison targets. A food target can be captured only if both agents reach it at the same time. Additional details are reported in Gupta et al. (2017).

4.2 Implementation details

In all our experiments, we use a neural network with one layer (512 unites) for the encoding (Eq. 1), a neural network with one layer (256 units) for the action selector (Eq. 7) and neural networks with three hidden layers (1024, 512, 256 units, respectively) for the critics. For MADDPG and MA-MADDPG the actors are implemented with neural networks with 2 hidden layers (512, 256 units). The size of the ${\mathbf {m}}$ is fixed to 200; this value that has been empirically found to be optimal given the network architectures (Section B.7 provides a validation study on the choice of memory size). Consequently, the size of $\mathbf {h_i} $ and ${\mathbf {e}}_i$ is set to 200. We use the Adam optimizer (Kingma and Ba 2014) with a learning rate of $ 10^{-3} $ for critic and $ 10^{-4} $ for policies. The reward discount factor is set to 0.95, the size of the replay buffer to $10^{6}$ and the batch size to 1024. The number of time steps for episode is set to 1000 for Waterworld and 100 for the other environments. We update network parameters after every 100 samples added to the replay buffer using soft updates with $ \tau = 0.01$. We train all the models over 60, 000 episodes of 100 time-steps each on all the environments, except for Waterworld for which we use 20, 000 episodes of 1000 time-steps each for training. The Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein 1930) with $ \theta = 0.15 $ and $ \sigma = 0.3 $ is a stochastic process which, over time, tends to drift towards its mean. This is commonly employed within DDPG (Lillicrap et al. 2015) and in order to introduce temporally correlated noise. Doing so it is possible to avoid the effects of averaging random decorrelated signals which would lead a less effective exploration. Discrete actions are supported by the Gumbel-Softmax, a biased, low-variance gradient estimator (Jang et al. 2016). This estimator is typically used within the back-propagation algorithm in the presence of categorical variables. We use Python 3.5.4 (Van Rossum and Drake Jr 1995) with PyTorch v0.3.0 (Paszke et al. 2017) as software for automatic differentiable computing and machine learning framework. All the computations were performed using Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz as CPU and GeForce GTX TITAN X as GPU.

4.3 Experimental results

In our experiments, we compared the proposed MD-MADDPG against four algorithms: MADDPG (Lowe et al. 2017), Meta-agent MADDPG (MA-MADDPG), CommNet (Sukhbaatar et al. 2016) and MAAC (Iqbal and Sha 2019). MA-MADDPG is a variation of MADDPG in which the policy of an agent during both training and execution is conditioned upon the observations of all the other agents in order to overcome difficulties due to partial observability. These methods have been selected to provide fair comparisons since they offer different learning approaches in multi-agent problems. MADDPG is what our method builds on so this comparison can quantify the improvements brought by the proposed communication mechanism; MA-MADDPG offers an alternative information sharing mechanism; CommNet implements an explicit form of communication; MAAC is a recent is a state-of-the-art method in which critics select information to share through an attention mechanism. We analyse the performance of these competing learning algorithms on all the six environments described in Sect. 4.1. In each case, after training, we evaluate an algorithm’s performance by collecting samples from an additional 1000 episodes, which are then used to extract different performance metrics: the reward quantifies how well a task has been solved; the distance from landmarks captures how closely an agent has reached the landmarks; the number of collisions counts how many times an agent has failed to avoid collisions with others; sync occupations counts how many times the landmarks have been occupied simultaneously and, analogously, not sync occupations counts how many times only one of the two landmarks has been occupied. For Waterworld, we also count the number of food targets and number of poison targets. Since this environment requires continuous actions, we cannot use MAAC as this method only operates on discrete action spaces. In Table 1, for each metric, we report the sample average and standard deviation obtained by each algorithm on each environment. A visualization of all results through boxplot can be found in Section B.5 in Supplementary Material.

Table 1 Comparison of MADDPG, MA-MADDPG, CommNet, MAAC and MD-MADDPG on six environments ordered by increasing level of difficulty, from CN to Waterword. The sample mean and standard deviation for 1000 episodes are reported for each metric

Full size table

All algorithms perform very similarly in the Cooperative Navigation and Partial Observable Navigation cases. This result is expected because these environments involve relatively simple tasks that can be completed even without explicit message-passing and information sharing functionalities. Despite communication not being essential, MD-MADDPG reaches comparable performance to MADDPG and MA-MADDPG. In the Synchronous Cooperative Navigation case, the ability of MA-MADDPG to overcome partial observability issues by sharing the observations across agents seem to be crucial as the total rewards achieved by this algorithm are substantially higher than those obtained by both MADDPG and MD-MADDPG. In this case, whilst not achieving the highest reward, MD-MADDPG keeps the number of unsynchronised occupations at the lowest level, and also performs better than MADDPG on all three metrics. It would appear that in this case pulling all the private observations together is sufficient for the agents to synchronize their paths leading to the landmarks.

When moving on to more complex tasks requiring further coordination, the performances of the three algorithms diverge further in favour of MD-MADDPG. The requirement for strong collaborative behaviour is more evident in the Sequential Cooperative Navigation problem as the agents need to explicitly learn to take either shorter or longer paths from their initial positions to the landmarks in order to occupy them in sequential order. Furthermore, according to the results in Table 1, the average distance travelled by the agents trained with MD-MADDPG is less then the half the distance travelled by agents trained with MADDPG, indicating that these agents were able to find a better strategy by developing an appropriate communication protocol. Similarly, in the Swapping Cooperative Navigation scenario, MD-MADDPG achieves superior performance, and is again able to discover solutions involving the shortest f. Waterworld is significantly more challenging as it requires a sustained level of synchronization throughout the entire episode and can be seen as a sequence of sub-tasks whereby each time the agents must reach a new food target whilst avoiding poison targets. In Table 1, it can be noticed that MD-MADDPG significantly outperforms both competitors in this case. The importance of sharing observations with other agents can also be seen here as MA-MADDPG generates good policies that avoid poison targets, yet in this case, the average reward is substantially lower than the one scored by MD-MADDPG. Experimental settings so far have involved two agents. In addition, we have also investigated settings with a higher number of agents, see Supplementary Material (Section B.2 for Cooperative Navigation and Section B.3 for Partial Observable Cooperative Navigation). These results show, that the prososed method can be successfully used on larger systems without incurring any numerical complications or convergence difficulties. When comparing to other algorithms, MD-MADDPG has resulted in superior performance performance are indeed achieved on Cooperative Navigation with respect to the reward metric. On Partially Observable Cooperative Navigation, there is no definite winner, nevertheless MD-MADDPG shows competitive performance, for example it outperforms all the baselines on the 5 agents scenario.

In Supplementary Material in Section B.4, we provide an ablation study showing that the main components of MD-MADDPG are needed for its correct behaviour. We investigate the effects of removing either one of the key components, i.e. context vector, read and write modules. Removing the context vector reduces the quality of the performance obtained on CN and on environments which require greater coordination efforts, like Sequential CN, Swapping CN and Waterworld. On PO-CN no significant differences in performance are reported, while on Synchronous CN sync occupations worsen (of approximatively five times the amount) and sync occupations improve (of approximatively twice the amount). This result is explained by the fact that in Sync CN, good strategies that do not involve explicit communication can be learnt to achieve good performance on sync occupations. The best overall performance method on this scenario is MA-MADDPG (see Table 1). This comparative method implements an implicit form of communication that is equivalent to a simple information sharing which can be very effective to overcome the partial observability issue which is the main challenge in Sync CN. We have observed that without the writing or reading components the performance worsened on all the run experiments.

4.4 Communication analysis

In this section, we explore the dynamic patterns of communication activity that emerged in the environments presented in the previous section, and look at how the agents use the shared memory throughout an episode while solving the required task. For each environment, after training, we executed episodes with time horizon T and stored the write vector $\mathbf {m'}$ of each agent at every time step t. Exploring how $\mathbf {m'}$ evolves within an episode can shed some light onto the role of the memory device at each phase of the task. The analysis presented in this section focuses on the write vector as we expect it to be stronger correlated with the environment dynamics than the other components. The content of the writing vector corresponds to the content of the communication channel itself, and is expected to contain information related to the task (e.g. changes in current environment, agent’s strategy or observed point of interests). A communication analysis with respect to the read vector ${\mathbf {r}}_i$ is presented in Supplementary Material (Section B.8). The content of the reading vector is an implicit representation internal to the agent itself that serves to interpret the content of the channel and at the same time to be utilised in the generation of $\mathbf {m'}$. In order to produce meaningful visualisations, we first projected the dimensions of $\mathbf {m'}$ onto the directions maximising the sample variance (i.e. the variance of the observed $\mathbf {m'}$ across simulated episodes) using a linear PCA.

Figure 3 shows the principal components (PCs) associated with the two agents over time for four of our six simulation environments. Only the first three PCs were retained as those were found to cumulatively explain over $80\%$ of variance in all cases. The values of each PC were standardised to lie in [0, 1] in order have them in the same in same range for fair comparisons and are plotted on a color map: one is in red and zero in blue. The timeline at the bottom of each figure indicates which specific phase of an episode is being executed at any given time point, and each consecutive phase is coloured using a different shade of grey. For instance, in Sequential Cooperative Navigation, a single landmark is reached and occupied in each phase. In Swapping Cooperative Navigation, during the first phase the agents search and find the landmarks; in the second phase they swap targets, and in the third phase they complete the task by reaching the landmarks again. In the Synchronous Cooperative Navigation the phase indicates if none of the landmarks is occupied (light-grey), if just one is occupied (dark-grey) and if both are occupied (black). Usually, in the last phase, the agents learn to stay close to their targets. This analysis pointed out that in the final phases, when tasks are already completed and there is no need of coordination, the PCs representing the communication activities assume lower (blue values), while during previous phases, when tasks are still to be solved and cooperation is stronger required, they assume higher values (red). This led us to interpret the higher values as being indicative of high memory usage, and lower values as being associated to low activity. In most cases, high communication activity is maintained when the agents are actively working and completing a task, while during the final phases (where typically there is no exploration because the task is considered completed) low activity levels are more predominant.

This analysis also highlights the fact that the communication channel is used differently in each environment. In some cases, the levels of activity alternate between agents. For instance, in Sequential Cooperative Navigation (Fig. 3a), high levels of memory usage by one agent are associated with low ones by the other. A different behaviour is observed for the other environments, indeed in Swapping Cooperative Navigation task where both agents produce either high or low activation value, whereas in Synchronous Cooperative Navigation the memory activity is very intense before the phase three, while agents are collaborating to complete the task. The dynamics characterizing the memory usage also change based on the particular phase reached within an episode. For example, in Fig. 3a, during the first two phases the agents typically show alternating activity levels whilst in the third phase both agents significantly decrease their memory activity as the task has already been solved and there are no more changes in the environment. Figure 3 provides some evidence that, in some cases, a peer-to-peer communication strategy is likely to emerge instead of a master-slave one where one agent takes complete control of the shared channel. The scenario is significantly more complex in Waterworld where the changes in memory usage appear at a much higher frequency due to the presence of very many sequential sub-tasks. Here, each light-grey phase indicates that a food target has been captured. Peaks of memory activity seem to follow those events as the agents reassess their situation and require higher coordination to jointly decide what the next target is going to be. In Supplementary Material (B.1) we provide further experimental results showing the importance of the communication by corrupting the memory content at execution time, which further corroborate the role of the exchanged messages in improving agents’ coordination.

5 Conclusions

In this work, we have introduced MD-MADDPG, a multi-agent reinforcement learning framework that uses a shared memory device as an intra-agent communication channel to improve coordination skills. The memory content contains a learned representation of the environment that is used to better inform the individual policies. The memory device is learnable end-to-end without particular constraints other than its size, and each agent develops the ability to modify and interpret it. We empirically demonstrated that this approach leads to better performance in small-scale (up to 6 agents in our experiments) cooperative tasks where coordination and synchronization are crucial to a successful completion of the task and where world visibility is very limited. Furthermore, we have visualised and analyzed the dynamics of the communication patterns that have emerged in several environments. This exploration has indicated that, as expected, the agents have learned different communication protocols depending upon the complexity of the task. In this study we have mostly focused on two-agent systems to keep the settings sufficiently simple to understand the role of the memory. Very competitive results have been obtained when more agents are used.

In future work, we plan on studying the role played by the sequential order in which the memory is updated, as the number of agents grows. A possible approach may consist of deploying agent selection mechanisms, possibly based on attention, so that only a relevant subset of agents can modify the memory at any given time, or impose master-slave architectures. A possible solution would be to have an agent acting as “scheduler”that controls the access to the memory, decides which information can be shared and provides scheduling for the writing accesses. Introducing such a scheduling agent would allow to keep the current framework unaltered, e.g. the sequential access to the memory would be retained. Although the scheduling agent would add an additional layer of complexity, this might reduce the number of memory access required in larger scale systems and improve the overall scalability. In future work, we will also apply MD-MADDPG on environments characterized by more structured and high-dimensional observations (e.g. pixel data) where collectively learning to represent the environment through a shared memory should be particularly beneficial.

References

Ahilan, S., & Dayan, P. (2019). Feudal multi-agent hierarchies for cooperative reinforcement learning. arXiv preprint arXiv:1901.08492
Brosig, J., Ockenfels, A., & Weimann, J., et al. (2003). Information and communication in sequential bargaining. Citeseer
Caicedo, J. C., & Lazebnik, S. (2015). Active object localization with deep reinforcement learning. In: Proceedings of the IEEE international conference on computer vision (pp. 2488–2496).
Chen, M. G. (2009). Communication, coordination, and camaraderie in world of warcraft. Games and Culture, 4(1), 47–73.
Google Scholar
Chu, X., & Ye, H. (2017). Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:1710.00336
Comfort, L. K. (2007). Crisis management in hindsight: Cognition, communication, coordination, and control. Public Administration Review, 67, 189–197.
Google Scholar
Cooper, R., DeJong, D. V., Forsythe, R., & Ross, T. W. (1989). Communication in the battle of the sexes game: Some experimental results. The RAND Journal of Economics, 20(4), 568.
MathSciNet Google Scholar
Cooper, R., De Jong, D. V., Forsythe, R., & Ross, T. W. (1992). Forward induction in coordination games. Economics Letters, 40(2), 167–172.
MATH Google Scholar
Cortes, J., Martinez, S., Karatas, T., & Bullo, F. (2002). Coverage control for mobile sensing networks. In: Proceedings of IEEE international conference on robotics and automation, 2002. ICRA’02, IEEE, (Vol. 2, pp. 1327–1332)
Crites, R. H., & Barto, A. G. (1998). Elevator group control using multiple reinforcement learning agents. Machine Learning, 33(2–3), 235–262.
MATH Google Scholar
Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., & Pineau, J. (2018). Tarmac: Targeted multi-agent communication. arXiv preprint arXiv:1810.11187
De Ruiter, J. P., Noordzij, M. L., Newman-Norlund, S., Newman-Norlund, R., Hagoort, P., Levinson, S. C., et al. (2010). Exploring the cognitive infrastructure of communication. Interaction Studies, 11(1), 51–77.
Google Scholar
Degris, T., White, M., & Sutton, R. S. (2012). Off-policy actor-critic. arXiv preprint arXiv:1205.4839
Demichelis, S., & Weibull, J. W. (2008). Language, meaning, and games: A model of communication, coordination, and evolution. American Economic Review, 98(4), 1292–1311.
Google Scholar
Evans, R., & Gao, J . (2016) . Deepmind ai reduces google data centre cooling bill by 40
Foerster, J., Assael, I. A., de Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. In: Advances in neural information processing systems (pp. 2137–2145).
Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., & Whiteson, S. (2017). Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926
Foerster, J. N., Song, F., Hughes, E., Burch, N., Dunning, I., Whiteson, S., Botvinick, M., & Bowling, M. (2018). Bayesian action decoder for deep multi-agent reinforcement learning. arXiv preprint arXiv:1811.01458
Fox, D., Burgard, W., Kruppa, H., & Thrun, S. (2000). A probabilistic approach to collaborative multi-robot localization. Autonomous Robots, 8(3), 325–344.
Google Scholar
French, A., Macedo, M., Poulsen, J., Waterson, T., & Yu, A. (2008). Multivariate analysis of variance (manova). San Francisco State University
Fusaroli, R., Bahrami, B., Olsen, K., Roepstorff, A., Rees, G., Frith, C., et al. (2012). Coming to terms: Quantifying the benefits of linguistic coordination. Psychological Science, 23(8), 931–939.
Google Scholar
Galantucci, B. (2005). An experimental study of the emergence of human communication systems. Cognitive Science, 29(5), 737–767.
Google Scholar
Garrod, S., Fay, N., Rogers, S., Walker, B., & Swoboda, N. (2010). Can iterated learning explain the emergence of graphical symbols? Interaction Studies, 11(1), 33–50.
Google Scholar
Guestrin, C., Lagoudakis, M., & Parr, R. (2002). Coordinated reinforcement learning. ICML, Citeseer, 2, 227–234.
Google Scholar
Gupta, J. K., Egorov, M., & Kochenderfer, M. (2017). Cooperative multi-agent control using deep reinforcement learning. In: International conference on autonomous agents and multiagent systems (pp. 66–83). Springer
Hernandez-Leal, P., Kaisers, M., Baarslag, T., de Cote, E. M. (2017). A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Google Scholar
Iqbal, S., & Sha, F. (2019). Actor-attention-critic for multi-agent reinforcement learning. ICML
Itō, T., Zhang, M., Robu, V., Fatima, S., Matsuo, T., & Yamaki, H. (2011). Innovations in agent-based complex automated negotiations. Berlin: Springer.
Google Scholar
Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144
Jarrassé, N., Charalambous, T., & Burdet, E. (2012). A framework to describe, analyze and generate interactive motor behaviors. PloS One, 7(11), e49945.
Google Scholar
Jiang, J., & Lu, Z. (2018). Learning attentional communication for multi-agent cooperation. arXiv preprint arXiv:1805.07733
Kearns, M. (2012). Experiments in social computation. Communications of the ACM, 55(10), 56–67.
Google Scholar
Kim, D., Moon, S., Hostallero, D., Kang, W. J., Lee, T., Son, K., & Yi, Y. (2019). Learning to schedule communication in multi-agent reinforcement learning. ICLR
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kong, X., Xin, B., Liu, F., & Wang, Y. (2017). Revisiting the master-slave architecture in multi-agent deep reinforcement learning. arXiv preprint arXiv:1712.07305
Kraemer, L., & Banerjee, B. (2016). Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190, 82–94.
Google Scholar
Lasswell, H. D. (1948). The structure and function of communication in society. The Communication of Ideas, 37(1), 136–39.
Google Scholar
Lauer, M., & Riedmiller, M. (2000). An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In: Proceedings of the seventeenth international conference on machine learning, Citeseer
Laurent, G. J., Matignon, L., Fort-Piat, L., et al. (2011). The world of independent learners is not markovian. International Journal of Knowledge-based and Intelligent Engineering Systems, 15(1), 55–64.
Google Scholar
Lazaridou, A., Peysakhovich, A., Baroni, M. (2016). Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436.
Google Scholar
Li, Y. (2017). Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. CoRR abs/1509.02971
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In: Machine learning proceedings 1994 (pp. 157–163). Elsevier
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in neural information processing systems (pp. 6379–6390)
Matignon, L., Laurent, G., & Le Fort-Piat, N. (2007). Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In: IEEE/RSJ international conference on intelligent robots and systems (pp. 157–163) IROS’07.x
Miller, J. H., & Moser, S. (2004). Communication and coordination. Complexity, 9(5), 31–40.
MathSciNet Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.
Google Scholar
Mordatch, I., & Abbeel, P. (2017). Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908
Olfati-Saber, R., Fax, J. A., & Murray, R. M. (2007). Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE, 95(1), 215–233.
MATH Google Scholar
Oliehoek, F. A., & Vlassis, N. (2007). Q-value functions for decentralized pomdps. In: Proceedings of the 6th international joint conference on autonomous agents and multiagent systems. ACM
Ono, N., & Fukumoto, K. (1996). Multi-agent reinforcement learning: A modular approach. In: Second international conference on multiagent systems (pp. 252–258).
Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-agent Systems, 11(3), 387–434.
Google Scholar
Parker, D. C., Manson, S. M., Janssen, M. A., Hoffmann, M. J., & Deadman, P. (2003). Multi-agent systems for the simulation of land-use and land-cover change: A review. Annals of the Association of American Geographers, 93(2), 314–337.
Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch
Peng, P., Yuan, Q., Wen, Y., Yang, Y., Tang, Z., Long, H., & Wang, J. (2017). Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:1703.10069
Peshkin, L., Kim, K. E., Meuleau, N., & Kaelbling, L. P. (2000). Learning to cooperate via policy search. In: Proceedings of the sixteenth conference on uncertainty in artificial intelligence (pp. 489–496). Morgan Kaufmann Publishers Inc.
Petrillo, A., Salvi, A., Santini, S., & Valente, A. S. (2018). Adaptive multi-agents synchronization for collaborative driving of autonomous vehicles with multiple communication delays. Transportation Research Part C: Emerging Technologies, 86, 372–392.
Google Scholar
Pipattanasomporn, M., Feroze, H., & Rahman, S. (2009). Multi-agent systems in a distributed smart grid: Design and implementation. In: Power systems conference and exposition (2009). PSCE’09 (pp. 1–8). IEEE: IEEE/PES.
Ren, W., & Sorensen, N. (2008). Distributed coordination architecture for multi-robot formation control. Robotics and Autonomous Systems, 56(4), 324–333.
MATH Google Scholar
Scardovi, L., & Sepulchre, R. (2008). Synchronization in networks of identical linear systems. In: 47th IEEE conference on decision and control, 2008. CDC 2008 (pp. 546–551). IEEE
Schmidhuber , J. (1996). A general method for multi-agent reinforcement learning in unrestricted environments. In: Adaptation, coevolution and learning in multiagent systems: papers from the 1996 AAAI spring symposium (pp. 84–87)
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.
Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In: International conference on machine learning (pp. 1889–1897)
Selten, R., & Warglien, M. (2007). The emergence of simple languages in an experimental coordination game. Proceedings of the National Academy of Sciences, 104(18), 7361–7366.
Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In: ICML
Singh, A., Jain, T., & Sukhbaatar, S. (2019). Learning when to communicate at scale in multiagent cooperative and competitive tasks. ICLR
Singh, S. P., Jaakkola, T., & Jordan, M. I.(1994). Learning without state-estimation in partially observable markovian decision processes. In: Proceedings of machine learning 1994 (pp. 284–292). Elsevier
Stone, P., & Veloso, M. (1998). Towards collaborative and adversarial learning: A case study in robotic soccer. International Journal of Human-Computer Studies, 48(1), 83–104.
Google Scholar
Sukhbaatar, S., & Fergus, R., et al. (2016). Learning multiagent communication with backpropagation. In: Advances in neural information processing systems (pp. 2244–2252)
Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning (Vol. 135). Cambridge: MIT Press.
MATH Google Scholar
Számadó, S. (2010). Pre-hunt communication provides context for the evolution of early human language. Biological Theory, 5(4), 366–382.
Google Scholar
Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., et al. (2017). Multiagent cooperation and competition with deep reinforcement learning. PloS One, 12(4), e0172395.
Google Scholar
Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In: Proceedings of the tenth international conference on machine learning (pp. 330–337)
Theisen, C. A., Oberlander, J., & Kirby, S. (2010). Systematicity and arbitrariness in novel communication systems. Interaction Studies, 11(1), 14–32.
Google Scholar
Tuyls, K., & Weiss, G. (2012). Multiagent learning: Basics, challenges, and prospects. Ai Magazine, 33(3), 41.
Google Scholar
Uhlenbeck, G. E., & Ornstein, L. S. (1930). On the theory of the brownian motion. Physical Review, 36(5), 823.
MATH Google Scholar
Van Rossum, G., & Drake, F. L, Jr. (1995). Python Tutorial. The Netherlands: Centrum voor Wiskunde en Informatica Amsterdam.
Google Scholar
Vorobeychik, Y., Joveski, Z., & Yu, S. (2017). Does communication help people coordinate? PloS One, 12(2), e0170780.
Google Scholar
Wen, G., Duan, Z., Yu, W., & Chen, G. (2012). Consensus in multi-agent systems with communication constraints. International Journal of Robust and Nonlinear Control, 22(2), 170–182.
MathSciNet MATH Google Scholar
Wen, Y., Yang, Y., Luo, R., Wang, J., & Pan, W. (2019). Probabilistic recursive reasoning for multi-agent reinforcement learning. arXiv preprint arXiv:1901.09207
Wharton, T. (2003). Natural pragmatics and natural codes. Mind & Language, 18(5), 447–477.
Google Scholar
Wunder, M., Littman, M., & Stone, M. (2009). Communication, credibility and negotiation using a cognitive hierarchy model. In: Workshop# 19: MSDM 2009, p 73
You, K., & Xie, L. (2011). Network topology and communication data rate for consensusability of discrete-time multi-agent systems. IEEE Transactions on Automatic Control, 56(10), 2262.
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

WMG, University of Warwick, Coventry, CV4 7AL, UK
Emanuele Pesce & Giovanni Montana

Authors

Emanuele Pesce
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Montana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emanuele Pesce.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Karsten Borgwardt, Po-Ling Loh, Evimaria Terzi, Antti Ukkonen.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2884 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pesce, E., Montana, G. Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication. Mach Learn 109, 1727–1747 (2020). https://doi.org/10.1007/s10994-019-05864-5

Download citation

Received: 21 January 2019
Revised: 28 October 2019
Accepted: 06 December 2019
Published: 23 January 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10994-019-05864-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication

Abstract

Similar content being viewed by others

A survey of multi-agent deep reinforcement learning with communication

Multi-agent Neural Reinforcement-Learning System with Communication

Learning controlled and targeted communication with the centralized critic for the multi-agent system

1 Introduction

2 Related work

3 Memory-driven MADDPG

3.1 Problem setup

3.2 Memory-driven communication

3.3 MD-MADDPG decentralised execution

4 Experimental settings and results

4.1 Environments

4.2 Implementation details

4.3 Experimental results

4.4 Communication analysis

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 2884 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation