1 Introduction

Reinforcement Learning (RL) allows agents to learn how to map observations to actions through feedback reward signals (Sutton and Barto 1998). Recently, deep neural networks (LeCun et al. 2015; Schmidhuber 2015) have had a noticeable impact on RL (Li 2017). They provide flexible models for learning value functions and policies, allow to overcome difficulties related to large state spaces, and eliminate the need for hand-crafted features and ad-hoc heuristics (Cortes et al. 2002; Parker et al. 2003; Olfati-Saber et al. 2007). Deep reinforcement learning (DRL) algorithms, which usually rely on deep neural networks to approximate functions, have been successfully employed in single-agent systems, including video game playing (Mnih et al. 2015), robot locomotion (Lillicrap et al. 2015), object localisation (Caicedo and Lazebnik 2015) and data-center cooling (Evans and Gao 2016).

Following the uptake of DRL in single-agent domains, there is now a need to develop improved learning algorithms for multi-agent (MA) systems where additional challenges arise. Markov Decision Processes, upon which DRL methods rely, assume that the reward distribution and dynamics are stationary (Hernandez-Leal et al. 2017). When multiple learners interact with each other, this property is violated because the reward that an agent receives also depends on other agents’ actions (Laurent et al. 2011). This issue, known as the moving-target problem (Tuyls and Weiss 2012), removes convergence guarantees and introduces additional learning instabilities. Further difficulties arise from environments characterized by partial observability (Singh et al. 1994; Chu and Ye 2017; Peshkin et al. 2000) whereby the agents do not have full access to the world state, and where coordination skills are essential.

An important challenge in multi-agent DRL is how to facilitate communication amongst interacting agents. Communication is widely known to play a critical role in promoting coordination between humans (Számadó 2010). Humans have been proven to excel at communicating even in absence of a conventional code (De Ruiter et al. 2010). When coordination is required and no common languages exist, simple communication protocols are likely to emerge (Selten and Warglien 2007). Human communication involves more than sending and receiving messages, it requires specialized interactive intelligence where receivers have the ability to recognize intentions and senders can properly design messages (Wharton 2003). The emergence of communication has been widely investigated (Garrod et al. 2010; Theisen et al. 2010), for example new signs and symbols can emerge when it comes to represent real concepts. Fusaroli et al. (2012) demonstrated that language can be seen as a social coordination device learnt through reciprocal interaction with the environment for optimizing coordinative dynamics. The relation between communication and coordination has been widely discussed (Vorobeychik et al. 2017; Demichelis and Weibull 2008; Miller and Moser 2004; Kearns 2012). Communication is an essential skill in many tasks: for instance, in critical situations, where is of fundamental importance to properly manage critical and urgent situations, like emergency response organizations (Comfort 2007). In multiplayer videogames, it is often essential to reach a sufficiently high level of coordination required to succeed (Chen 2009).

Two-agents systems have often been studied when looking at the effects of communication on coordination. Galantucci (2005) showed that humans can easily produce new protocols to overcame the lack of a common language, through experiments in which pairs of participants playing video games were allowed to send messages through a medium that prevented the use of standard symbols. In two-players games, like the Battle of the Sexes, improved coordination resulted when allowing gamers to exchange messages (Cooper et al. 1989). Human conversations can be interpreted as a bi-directional communication form, where the same entity can both send and receive messages (Lasswell 1948). This kind of communication can be efficiently explored in small-scale systems through coordination games (Cooper et al. 1992) and often it is the key to achieve success in real world scenarios such as bargaining with incomplete information (Brosig et al. 2003).

Analogously, the importance of communication has been recognised when designing artificial MA learning systems, especially in tasks requiring synchronization (Scardovi and Sepulchre 2008; Wen et al. 2012). For example, in navigation tasks, agents can localise each other more easily through shared information (Fox et al. 2000). In group strategy coordination, such as automating negotiations, communication is fundamental to improve the final outcome (Wunder et al. 2009; Itō et al. 2011). A wide range of MA applications have benefitted from inter-agent message passing including distributed smart grid control (Pipattanasomporn et al. 2009), consensus in networks (You and Xie 2011), multi-robot control (Ren and Sorensen 2008), autonomous vehicle driving (Petrillo et al. 2018), elevators control (Crites and Barto 1998), soccer-playing robots (Stone and Veloso 1998) and for language learning in two-agent systems (Lazaridou et al. 2016).

Recently, Lowe et al. (2017) have proposed MADDPG (Multi-Agent Deep Deterministic Policy Gradient). Their approach extends the actor-critic algorithm (Degris et al. 2012) in which each agent has an actor to select actions and a critic to evaluate them. MADDPG embraces the centralised learning and decentralised execution paradigm (CLDE) (Foerster et al. 2016; Kraemer and Banerjee 2016; Oliehoek and Vlassis 2007). During centralised training, the critics receive observations and actions from all the agents whilst the actors only see their local observations. On the other hand, the execution only relies on actors. This approach has been designed to address the emergence of environment non-stationarity (Tuyls and Weiss 2012; Laurent et al. 2011) and has been shown to perform well in a number of mixed competitive and cooperative environments. In MADDPG, the agents can only share each other’s actions and observations during training through their critics, but do not have the means to develop an explicit form of communication through their experiences. The input size of each critic increases linearly with the number of agents (Lowe et al. 2017), which hinders its scalability (Jiang and Lu 2018).

In this article, we consider tasks requiring strong coordination and synchronization skills. In order to thoroughly study the effects of communication on these scenarios, we focus on small-scale systems. This allows us to design tasks with an increasing level of complexity, and simplifies the investigation of possible correlations between the level of messages being exchanged and any environmental changes. We provide empirical evidences that the proposed method reaches very good performance on a range of two-agent scenarios when a high level of cooperation is required. In addition, we present experimental results for systems with up to six agents in the Supplementary Material (Section B.2 and B.3). In the real world, there is range of applications involving the coordination of only a few actors, for example motor interactions like sawing or cooperative lifting of heavy loads (Jarrassé et al. 2012).

In such cases, being able to communicate information beyond the private observations, and infer a shared representation of the world through interactions, becomes essential. Ideally, an agent should be able to remember its current and past experience generated when interacting with the environment, learn how to compactly represent these experiences in an appropriate manner, and share this information for others to benefit from. Analogously, an agent should be able to learn how to decode the information generated by other agents and leverage it under every environmental state. The above requirements are captured here by introducing a communication mechanism facilitating information sharing within the CLDE paradigm. Specifically, we provide the agents with a shared communication device that can be used to learn from their collective private observations and share relevant messages with others. Each agent also learns how to decode the memory content in order to improve its own policy. Both the read and write operations are implemented as parametrised, non-linear gating mechanisms that are learned concurrently with the individual policies. When the underlying task to be solved demands for complex coordination skills, we demonstrate that our approach can achieve higher performance compared to the MADDPG baseline in small-scale systems. Furthermore, we demonstrate that being able to learn end-to-end a communication protocol jointly with the policies can also improve upon a meta-agent approach whereby all the agents perfectly share all their observations and actions in both training and execution. We investigate a potential interpretation of the communication patterns that have emerged when training two-agent systems through time-varying low-dimensional projections and their visual assessment, and demonstrate how these patterns correlate with the underlying tasks being learned.

This article is organised as follow. In Sect. 2 a general overview of related work is offered to characterize state-of-the-art approaches for MARL with special focus on communication systems. Section 3 contains the formalization of the problem setup, the details of the proposed method and the description of the learning process; all the experiments are reported in Sect. 4 where results are presented in terms of numerical metrics to evaluate the performance achieved on six different scenarios; an analysis of the communication channel is provided to support qualitative insights of the exchanged messages. Conclusive comments are given in Sect. 5. In the Supplementary Material, Section A describes details of MA-MADDPG, a comparative method, and Section B presents a range of additional experiments to further investigate the effects of memory corruption; changes in performance when increasing the number of agents; an ablation study to validate the components used in the proposed method; box plots with the main results, an assessment of the robustness of the method when changing the random seeds; additional analyses of the communication channel.

2 Related work

The problem of reinforcement learning in cooperative environments has been studied extensively (Littman 1994; Schmidhuber 1996; Panait and Luke 2005; Matignon et al. 2007). Early attempts exploited single-agent techniques like Q-learning to train all agents independently (Tan 1993), but suffered from the excessive size of the state space resulting from having multiple agents. Subsequent improvements were obtained using variations of Q-learning (Ono and Fukumoto 1996; Guestrin et al. 2002) and distributed approaches (Lauer and Riedmiller 2000). More recently, DRL techniques like DQN (Mnih et al. 2013) have led to superior performance in single-agents settings by approximating policies through deep neural networks. Tampuu et al. (2017) have demonstrated that an extension of the DQN is able to train multiple agents independently to solve a popular two-agent system, the Pong game. Gupta et al. (2017) have analyzed the performance of popular DRL algorithms, including DQN, DDPG (Lillicrap et al. 2015), TRPO (Schulman et al. 2015) and actor-critic on different MA environments, and have introduced a curriculum learning approach to increase scalability. Foerster et al. (2017) have suggested using a centralized critic for all agents that marginalises out a single’s agent action while other agents’ actions are kept fixed. Iqbal and Sha (2019) proposed MAAC (Multi-Actor-Attention-Critic), a framework for learning decentralised policies with centralised critics, which selects relevant information for each agent at every time-step through an attention mechanism. In more recent work, a probabilistic recursive reasoning framework has been proposed to model behaviours in a two-agents context; each agent, through variational Bayes methods, approximates the other agent policy to predict its strategy and then to improve its own policy (Wen et al. 2019).

The role of communication in cooperative settings has also been explored, and different methods have been proposed differing on how the communication channels have been formulated using DRL techniques. Many approaches rely on implicit communication mechanisms whereby the weights of the neural networks used to implement policies or action-value functions are shared across agents or modelled to allow inter-agent information flow. For instance, in CommNet (Sukhbaatar et al. 2016), the policies are implemented through subsets of units of a large feed-forward neural network mapping the inputs of all agents to actions. At any given time step, the hidden states of each agent are used as messages, averaged and sent as input for the next layer. Singh et al. (2019) proposed IC3NEt, a model designed to improve CommNet, where the hidden states of the agents are also used as messages, but this time they are averaged only after being weighted by a gating mechanism. In addition, in IC3Net, each agent is implemented through an Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) in order to consider the history of the seen observations. In BiCNet (Peng et al. 2017), the agents’ policies and value networks are connected through bidirectional neural networks, and trained using an actor-critic approach. Jiang and Lu (2018) proposed an attention mechanism that, when a need for communication emerges, selects which subsets of agents should communicate; the hidden states of their policy networks are integrated through an LSTM to generate a message that is used as input for the next layer of the policy network. Das et al. (2018) utilised a soft attention mechanism to allow the agents to select the recipients of their messages. Each agent, along with the message, broadcasts a signature which can be used to encode agent-specific information. Kong et al. (2017) introduce a master-slave architecture whereby a master agent provides high-level instructions to organize the slave agents in an attempt to achieve fine-grained optimality. Similarly, in Feudal Multiagent Hierarchies (Ahilan and Dayan 2019), an agent acts as “manager”and learns to communicate sub-goals to multiple workers operating simultaneously. A different approach is instead provided by the Bayesian Action Decoder (BAD) (Foerster et al. 2018), a technique for two-agent settings where an approximate Bayesian update is used to produce public belief that directly conditions the actions of all agents.

In our work, we introduce a mechanism to generate explicit messages capturing relevant aspects of the world, which the agents are able to collectively learn using their observations and interactions. The messages are then sent and received to complement their private observations when making decisions. Some aspects of our approach are somewhat related to DIAL (Differentiable Inter-Agent Learning) (Foerster et al. 2016) in that the communication is enabled by differentiable channels allowing the gradient of the Q-function to bring the proper feedback in small-scale scenarios. Like DIAL, we would like the agents to share explicit messages. However, whereas DIAL uses simple and pre-determined protocols, our agents are given the ability to infer complex protocols from experience, without necessarily relying on pre-defined ones, and utilise those to learn better policies. Explicit messages are also used in SchedNet (Kim et al. 2019) to investigate situations where the bandwidth is limited and only some of the agents are allowed to communicate. In their approach, also focusing on small-case scenarios to better capture the scheduling constraints, the agents produce messages by encoding their observations and a scheduler decides whether an agent is allowed to use a communication channel. A limited bandwidth channel is also used in our work, but all the agents have full access to the channel.

3 Memory-driven MADDPG

3.1 Problem setup

We consider a system with N interacting agents, where N is typically small, and adopt a multi-agent extension of partially observable Markov decision processes (Littman 1994). This formulation assumes a set, \( {\mathcal {S}}\), containing all the states characterising the environment; a sequence \(\{{\mathcal {A}}_1, {\mathcal {A}}_2, \dots , {\mathcal {A}}_N\}\) where each \({\mathcal {A}}_i\) is a set of possible actions for the \(i^{th}\) agent; a sequence \(\{{\mathcal {O}}_1, {\mathcal {O}}_2, \dots , {\mathcal {O}}_N\}\) where each \( {\mathcal {O}}_i\) contains the observations available to the \(i^{th}\) agent. Each \(\varvec{o}_i \in {\mathcal {O}}_i\) provides a partial characterisation of the current state and is private for that agent. Every action \(a_i \in {\mathcal {A}}_i\) is deterministically chosen accordingly to a policy function, \( \varvec{\mu }_{\theta _i}: {\mathcal {O}}_i \mapsto A_i \), parametrised by \(\theta _i\). The environment generates a next state according to a transition function, \( {\mathcal {T}}: S \times {\mathcal {A}}_1 \times {\mathcal {A}}_2 \times \dots \times {\mathcal {A}}_N \), that considers the current state and the N actions taken by the agents. The reward received by an agent, \( r_i : {\mathcal {S}} \times {\mathcal {A}}_1 \times {\mathcal {A}}_2 \times \dots \times {\mathcal {A}}_N \mapsto {\mathbb {R}}\) is a function of states and actions. Each agent learns a policy that maximises the expected discounted future rewards over a period of T time steps, \(J(\theta _i) = {\mathbb {E}} [R_i]\), where \(R_i = \sum _{t=0}^{T} \gamma ^t r_i(s^t_i,a^t_i)\) is the \(\gamma \)-discounted sum of future rewards. During training, we would like an agent to learn by using not only its own observations, but through a collectively learned representation of the world that accumulates through experiences coming from all the agents. At the same time, each agent should develop the ability to interpret this shared knowledge in its own unique way as needed to optimise its policy. Finally, the information sharing mechanism would need to be designed in such a way to be used in both training and execution.

3.2 Memory-driven communication

We introduce a shared communication mechanism enabling agents to establish a communication protocol through a memory device \( {\mathcal {M}} \) of pre-determined capacity M (Fig. 1). The device is designed to store a message \( {\mathbf {m}} \in {\mathbb {R}}^M \) which progressively captures the collective knowledge of the agents as they interact. An agent’s policy becomes \( \varvec{\mu }_{\theta _i}: {\mathcal {O}}_i \times {\mathcal {M}} \mapsto A_i \), i.e. it is dependent on the agent’s private observation as well as the collective memory. Before taking an action, each agent accesses the memory device to initially retrieve and interpret the message left by others. After reading the message, the agent performs a writing operation that updates the memory content. During training, these operations are learned without any a priori constraint on the nature of the messages other than the device’s size, M. During execution, the agents use the communication protocol that they have learned to read and write the memory over an entire episode. We aim to build a model trainable end-to-end only through reward signals, and use neural networks as function approximators for policies, and learnable gated functions as mechanisms to facilitate an agent’s interactions with the memory. The chosen parametrisations of these operations are presented and discussed below.

Fig. 1
figure 1

The MD-MADDPG framework. During training and testing, each policy uses its observation and the content of the shared memory to produce a new action and then update the shared channel. Critics are used during training only and each one of them takes as input all the observations and actions

Encoding operation Upon receiving its private observations, each agent maps them on to an embedding representing the agent’s current vision of the state:

$$\begin{aligned} {\mathbf {e}}_i = \varphi _{\theta _{i}^e}^{enc}(\varvec{o}_i), {\mathbf {e}}_i \in {\mathbb {R}}^{E} \end{aligned}$$
(1)

where \( \varphi _{\theta _{i}^e}^{enc} \) is a neural network parametrised by \( \theta _{i}^e \). The embedding \( {\mathbf {e}}_i \) plays a fundamental role in selecting a new action and in the reading and writing phases.

Read operation After encoding the current information, the agent performs a read operation allowing to extract and interpret relevant knowledge that has been previously captured through \({\mathcal {M}}\). By interpreting this information content, the agent has access to what other agents have learned. A context vector \({\mathbf {h}}_i\) is generated to capture spatio-temporal information previously encoded in \({\mathbf {e}}_i\) through a linear mapping,

$$\begin{aligned} {\mathbf {h}}_i = {\mathbf {W}}_i^{h}{\mathbf {e}}_i, {\mathbf {h}}_i \in {\mathbb {R}}^{H}, {\mathbf {W}}_i^{h} \in {\mathbb {R}}^{H \times E} \end{aligned}$$

where \( {\mathbf {W}}_i^{h} \) represent the learnable weights of the linear projection. While \({\mathbf {e}}_i\) is defined as general observation encoder, \({\mathbf {h}}_i\) is specifically designed to extract features for the reading operation. The context vector \({\mathbf {h}}_i\) can be interpreted as an agent’s internal representation that uses the observation embedding \({\mathbf {e}}_i\) to extract information to be utilized by the gating mechanism only (Eq. 2); its output is then used to extract information from the memory. The main function of the context vector is to facilitate the emergence of an internal representation specifically designed for interpreting the memory content during the read phase. An ablation study aimed at investigating the added benefits introduced by \({\mathbf {h}}_i\) is provided in the Supplementary Material (B.4). This study supports our intuition that the context vector is crucial for the proper functioning of the entire framework on more complex environments. The agent observation embedding \({\mathbf {e}}_i\), the reading context vector \({\mathbf {h}}_i\) and the current memory \({\mathbf {m}}\) contain different types of information that are used jointly as inputs to learn a gating mechanism,

$$\begin{aligned} {\mathbf {k}}_i = \sigma ({\mathbf {W}}_i^{k}[{\mathbf {e}}_i,{\mathbf {h}}_i, {\mathbf {m}}]), {\mathbf {k}}_i \in [0,1]^{M},{\mathbf {W}}_i^{k} \in {\mathbb {R}}^{M \times (E + H + M)} \end{aligned}$$
(2)

where \( \sigma ( \cdot ) \) is the sigmoid function and \([{\mathbf {e}}_i,{\mathbf {h}}_i, {\mathbf {m}}]\) means that the three vectors are concatenated. The values of \( {\mathbf {k}}_i \) are used as weights to modulate the memory content and extract the information from it, i.e.

$$\begin{aligned} {\mathbf {r}}_i ={\mathbf {m}} \odot {\mathbf {k}}_i \end{aligned}$$
(3)

where \( \odot \) represents the Hadamard product. \({\mathbf {k}}_i\) takes values in [0, 1] and its role is to potentially downgrade the information stored in memory or even completely discard the current content. Learning agent-specific weights \({\mathbf {W}}_i^h\) and \({\mathbf {W}}_i^k \) means that each agent is able to interpret \({\mathbf {m}}\) in its own unique way. As the reading operation strongly depends on the current observation, the interpretation of \({\mathbf {m}}\) can change from time to time depending on what an agent sees during an episode. Given that \({\mathbf {r}}_i\) depends on \({\mathbf {m}}\) and \({\mathbf {e}}_i\) (from \(\varvec{o}_i\) in Eq. 1), we lump all the adjustable parameters into \(\theta _{i}^{\zeta } = \{ {\mathbf {W}}_i^h, {\mathbf {W}}_i^k \} \) and write

$$\begin{aligned} {\mathbf {r}}_i = \zeta _{\theta _{i}^{\zeta }}(\varvec{o}_i, {\mathbf {m}}). \end{aligned}$$
(4)

Write operation In the writing phase, an agent decides what information to share and how to properly update the content of the memory whilst taking into account the other agents. The write operation is loosely inspired by the LSTM (Hochreiter and Schmidhuber 1997) where the content of the memory is updated through gated functions regulating what information is kept and what is discarded. Initially, the agent generates a candidate memory content, \({\mathbf {c}}_i\), which depends on its own encoded observations and current shared memory through a non-linear mapping,

$$\begin{aligned} {\mathbf {c}}_i = tanh({\mathbf {W}}_i^{c}[{\mathbf {e}}_i,{\mathbf {m}}]) {\mathbf {c}}_i \in [-1,1]^{M},{\mathbf {W}}_i^{c} \in {\mathbb {R}}^{M \times (E + M)} \end{aligned}$$

where \({\mathbf {W}}_i^{c}\) are weights to learn. An input gate, \({\mathbf {g}}_i\), contains the values used to regulate the content of this candidate while a forget gate, \({\mathbf {f}}_i\), is used to decide what to keep and what to discard from the current \( {\mathbf {m}} \). These operations are described as follows:

$$\begin{aligned} \begin{aligned} {\mathbf {g}}_i = {}&\sigma ({\mathbf {W}}_i^{g}[{\mathbf {e}}_i,{\mathbf {m}}]), {\mathbf {g}}_i \in [0,1]^{M},{\mathbf {W}}_i^{g} \in {\mathbb {R}}^{M \times (E + M)}\\ {\mathbf {f}}_i = {}&\sigma ({\mathbf {W}}_i^{f}[{\mathbf {e}}_i,{\mathbf {m}}]), {\mathbf {f}}_i \in [0,1]^{M},{\mathbf {W}}_i^{f} \in {\mathbb {R}}^{M \times (E + M)}. \end{aligned} \end{aligned}$$

The \(i^{th}\) agent then finally generates an updated message as a weighted linear combination of old and new messages, as follows:

$$\begin{aligned} \mathbf {m'} = {\mathbf {g}}_i \odot {\mathbf {c}}_i + {\mathbf {f}}_i \odot {\mathbf {m}}. \end{aligned}$$
(5)

The update \({\mathbf {m}}'\) is stored in memory \({\mathcal {M}}\) and made accessible to other agents. At each time step, agents sequentially read and write the content of the memory using the above procedure. Since \(\mathbf {m'}\) depends on \({\mathbf {m}}\) and \(\varvec{e}_i\) (derived from \(\varvec{o}_i\) in Eq. 1) we collect all the parameters into \(\theta _{i}^{\xi } = \{ {\mathbf {W}}_i^c, {\mathbf {W}}_i^g, {\mathbf {W}}_i^f \}\) and write the writing operation as:

$$\begin{aligned} \mathbf {m'} = \xi _{\theta _{i}^{\xi }}(\varvec{o}_i, {\mathbf {m}}). \end{aligned}$$
(6)

Action selector Upon completing both read and write operations, the agent is able to take an action, \(a_i\), which depends on the current encoding of its observations, its own interpretation of the current memory content and its updated version, that is

$$\begin{aligned} a_i = \varphi _{\theta _{i}^a}^{act}({\mathbf {e}}_i,{\mathbf {r}}_i,\mathbf {m'}) \end{aligned}$$
(7)

where \( \varphi _{\theta _{i}^a}^{act} \) is a neural network parametrised by \({\theta _{i}^a}\). The resulting policy function can be written as a composition of functions:

$$\begin{aligned} \varvec{\mu }_{\theta _i}(\varvec{o}_i, {\mathbf {m}}) = \varphi _{\theta _{i}^a}^{act}(\varphi _{\theta _{i}^e}^{enc}(\varvec{o}_i),\zeta _{\theta _{i}^{\zeta }}(\varvec{o}_i, {\mathbf {m}}), \xi _{\theta _{i}^{\xi }}(\varvec{o}_i, {\mathbf {m}})) \end{aligned}$$
(8)

in which \(\theta _{i} = \{ \theta _i^a, \theta _i^e, \theta _i^\zeta , \theta _i^\xi \}\) contains all the relevant parameters.

Learning algorithm All the agent-specific policy parameters, i.e. \(\theta _{i}\), are learned end-to-end. We adopt an actor-critic model within a CLDE framework (Foerster et al. 2016; Lowe et al. 2017). In the standard actor-critic model (Degris et al. 2012), we have an actor to select the actions, and a critic, to evaluate the actor moves and provide feedback. In DDPG (Silver et al. 2014; Lillicrap et al. 2015), neural networks are used to approximate both the actor, represented by the policy function \(\varvec{\mu }_{\omega _i}\), and its corresponding critic, represented by an action-value function \( Q^{\varvec{\mu }_{\omega _i}}: {\mathcal {O}}_i \times {\mathcal {A}}_i \mapsto {\mathbb {R}} \), in order to maximize the objective function \(J(\omega _i) = {\mathbb {E}} [R_i]\). This is done by adjusting the parameters \(\omega _i\) in the direction of the gradient of \(J(\omega _i)\) which can be written as:

$$\begin{aligned} \nabla _{\omega _i} J(\omega _i) = {\mathbb {E}}_{s \sim {\mathcal {D}}} \big [ \nabla _{\omega _i} \varvec{\mu }_{\omega _i}(\varvec{o}_i) \nabla _{a_i} Q^{\varvec{\mu }_{\omega _i}}(\varvec{o}_i,a_i) |_{a_i=\varvec{\mu }_\omega (\varvec{o}_i)} \big ] \end{aligned}$$

The actions a are produced by the actor \(\varvec{\mu }_{\omega _i}\), are evaluated by the critic \( Q^{\varvec{\mu }_i} \) which minimises the following loss:

$$\begin{aligned} {\mathcal {L}}(\omega _i) = {\mathbb {E}}_{\varvec{o}_i, a_i, r, \varvec{o}'_i \sim {\mathcal {D}}} \Big [(Q^{\varvec{\mu }_{\omega _i}}(\varvec{o}_i, a_i) - y)^2 \Big ] \end{aligned}$$

where \(\varvec{o}'_i\) is the next observation, \({\mathcal {D}}\) is an experience replay buffer which contains tuples \((\varvec{o}_i,\varvec{o}'_i,a,r)\), \(y = r + \gamma Q^{\varvec{\mu '}_{\omega }}(\varvec{o}'_i, a'_i)\) represent the target Q-value. \(Q^{\varvec{\mu '}_{\omega _i}}\) is a target network whose parameters are periodically updated with the current parameters of \(Q^{\varvec{\mu }_{\omega _i}}\) to make training more stable. \({\mathcal {L}}(\omega _i)\) minimises the expectation of the difference between the current and the target action-state function.

In this formulation, as there is no interaction between agents, the policies are learned independently. We adopt the CLDE paradigm by letting the critics \(Q^{\varvec{\mu }_{\omega _i}}\) use the observations \( {\mathbf {x}} = (\varvec{o}_1, \varvec{o}_2, \dots , \varvec{o}_N)\) and the actions of all agents, hence:

$$\begin{aligned} \nabla _{\omega _i}J(\varvec{\mu }_{\omega _i}) = {\mathbb {E}}_{{\mathbf {x}}, a \sim {\mathcal {D}}} \Big [\nabla _{\omega _i}\varvec{\mu }_{\theta _i}(\varvec{o}_i) \nabla _{a_i}Q^{\varvec{\mu }_{\omega _i}}({\mathbf {x}},a_1, a_2, \dots , a_N)|_{a_i=\varvec{\mu }_{\omega _i}(\varvec{o}_i)} \Big ] \end{aligned}$$
(9)

where \( {\mathcal {D}} \) contains transitions in the form of \( ( {\mathbf {x}}, {\mathbf {x}}', a_1, a_2, \dots , a_N, r_1, \dots , r_n )\) and \( \mathbf {x'} = (\varvec{o}'_1, \varvec{o}'_2, \dots , \varvec{o}'_N) \) are the next observations of all agents. Accordingly, \( Q^{\varvec{\mu }_{\omega _i}} \) is updated as

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\omega _i) = {}&{\mathbb {E}}_{{\mathbf {x}}, a, r, \mathbf {x'} \sim {\mathcal {D}}} \Big [(Q^{\varvec{\mu }_{\omega _i}}({\mathbf {x}}, a_1, a_2, \dots , a_N) - y)^2 \Big ], \\ y = {}&r_i + \gamma Q^{\varvec{\mu '}_{\omega _i}}(\mathbf {x'}, a'_1, a'_2, \dots , a'_N) \} \end{aligned} \end{aligned}$$
(10)

in which \( a'_1, a'_2, \dots , a'_N \) are the next actions of all agents. By minimising Eq. 10 the model attempts to improve the estimate of the critic \(Q^{\varvec{\mu }_{\omega _i}}\) which is used to improve the policy itselfs through Eq. 9. Since the input of the policy described in Eq. 8 is \((\varvec{o}_i, {\mathbf {m}})\) the gradient of the resulting algorithm to maximize \(J(\theta _i) = {\mathbb {E}} [R_i]\) can be written as:

$$\begin{aligned} \nabla _{\theta _i}J(\varvec{\mu }_{\theta _i}) ={\mathbb {E}}_{{\mathbf {x}}, a, {\mathbf {m}} \sim {\mathcal {D}}} \Big [\nabla _{\theta _i}\varvec{\mu }_{\theta _i}(\varvec{o}_i, {\mathbf {m}}) \nabla _{a_i}Q^{\varvec{\mu }_{\theta _i}}({\mathbf {x}}, a_1, \dots , a_N)|_{a_i=\varvec{\mu }_{\theta _i}(\varvec{o}_i, {\mathbf {m}})} \Big ] \end{aligned}$$

where \( {\mathcal {D}} \) is a replay buffer which contains transitions in the form of \( ( {\mathbf {x}}, {\mathbf {x}}', a_1, \dots , a_N, {\mathbf {m}}, r_1, \dots , r_n )\). The \( Q^{\varvec{\mu }_{\theta _i}} \) function is updated according to Eq. 10. Algorithm 1 provides the pseudo-code of the resulting algorithm, that we call MD-MADDPG (Memory-driven MADDPG).

3.3 MD-MADDPG decentralised execution

During execution, only the learned actors \(\varvec{\mu }_{\theta _1},\varvec{\mu }_{\theta _2}, \dots , \varvec{\mu }_{\theta _N} \) are used to make decisions and select actions. An action is taken in turn by a single agent. The current agent receives its private observations, \(\varvec{o}_i\), reads \({\mathcal {M}}\) to extract \({\mathbf {r}}_i\) (Eq. 3), generates the new version of \({\mathbf {m}}\) (Eq. 5), stores it into \({\mathcal {M}}\) and selects its action \(a_i\) using \(\varvec{\mu }_i\). The policy of the next agent is then driven by the updated memory.

figure a

4 Experimental settings and results

4.1 Environments

In this section, we present a battery of six two-dimensional navigation environments (Fig. 2), with continuos space and discrete time. We introduce tasks of increasing complexity, requiring progressively more elaborated coordination skills: five environments are inspired by the Cooperative Navigation problem from the multi-agent particle environment (Lowe et al. 2017; Mordatch and Abbeel 2017) in addition to Waterworld from the SISL suite (Gupta et al. 2017). We focus on two-agents systems to keep the settings sufficiently simple and attempt an initial analysis and interpretation of emerging communication behaviours. A short description of the six environments is in order.

Cooperative navigation (CN) This environment consists of N agents and N corresponding landmarks. An agent’s task is to occupy one of the landmarks whilst avoiding collisions with other agents. Every agent observes the distance to all others agents and landmark positions.

Partial observable cooperative navigation (PO CN) This is based on Cooperative Navigation, i.e. the task and action space are the same, but the agents now have a limited vision range and can only observe a portion of the environment around them within a pre-defined radius.

Synchronous cooperative navigation (Sync CN) The agents need to occupy the landmarks exactly at the same time in order to be positively rewarded. A landmark is declared as occupied when an agent is arbitrarily close to it. Agents are penalised when the landmarks are not occupied at the same time.

Sequential cooperative navigation (Sequential CN) This environments is similar to the previous one, but the agents here need to occupy landmarks sequentially and avoid to reach them simultaneuosly in order to be positively rewarded. Occupying the landmarks at the same time is penalised.

Swapping cooperative navigation (Swapping CN) In this case the task is more complex as it consists of two sub-tasks. Initially, the agents need to reach the landmarks and occupy them at same time. Then, they need to swap their landmarks and repeat the same process.

Waterworld In this environment, two agents with limited range vision have to collaboratively capture food targets whilst avoiding poison targets. A food target can be captured only if both agents reach it at the same time. Additional details are reported in Gupta et al. (2017).

Fig. 2
figure 2

An illustration of our environments. Blue circles represent the agents; dashed lines indicate the range of vision; green and red circles represent the food and poison targets, respectively, while black dots represent landmarks to be reached (Color figure online)

4.2 Implementation details

In all our experiments, we use a neural network with one layer (512 unites) for the encoding (Eq. 1), a neural network with one layer (256 units) for the action selector (Eq. 7) and neural networks with three hidden layers (1024, 512, 256 units, respectively) for the critics. For MADDPG and MA-MADDPG the actors are implemented with neural networks with 2 hidden layers (512, 256 units). The size of the \({\mathbf {m}}\) is fixed to 200; this value that has been empirically found to be optimal given the network architectures (Section B.7 provides a validation study on the choice of memory size). Consequently, the size of \(\mathbf {h_i} \) and \({\mathbf {e}}_i\) is set to 200. We use the Adam optimizer (Kingma and Ba 2014) with a learning rate of \( 10^{-3} \) for critic and \( 10^{-4} \) for policies. The reward discount factor is set to 0.95, the size of the replay buffer to \(10^{6}\) and the batch size to 1024. The number of time steps for episode is set to 1000 for Waterworld and 100 for the other environments. We update network parameters after every 100 samples added to the replay buffer using soft updates with \( \tau = 0.01\). We train all the models over 60, 000 episodes of 100 time-steps each on all the environments, except for Waterworld for which we use 20, 000 episodes of 1000 time-steps each for training. The Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein 1930) with \( \theta = 0.15 \) and \( \sigma = 0.3 \) is a stochastic process which, over time, tends to drift towards its mean. This is commonly employed within DDPG (Lillicrap et al. 2015) and in order to introduce temporally correlated noise. Doing so it is possible to avoid the effects of averaging random decorrelated signals which would lead a less effective exploration. Discrete actions are supported by the Gumbel-Softmax, a biased, low-variance gradient estimator (Jang et al. 2016). This estimator is typically used within the back-propagation algorithm in the presence of categorical variables. We use Python 3.5.4 (Van Rossum and Drake Jr 1995) with PyTorch v0.3.0 (Paszke et al. 2017) as software for automatic differentiable computing and machine learning framework. All the computations were performed using Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz as CPU and GeForce GTX TITAN X as GPU.

4.3 Experimental results

In our experiments, we compared the proposed MD-MADDPG against four algorithms: MADDPG (Lowe et al. 2017), Meta-agent MADDPG (MA-MADDPG), CommNet (Sukhbaatar et al. 2016) and MAAC (Iqbal and Sha 2019). MA-MADDPG is a variation of MADDPG in which the policy of an agent during both training and execution is conditioned upon the observations of all the other agents in order to overcome difficulties due to partial observability. These methods have been selected to provide fair comparisons since they offer different learning approaches in multi-agent problems. MADDPG is what our method builds on so this comparison can quantify the improvements brought by the proposed communication mechanism; MA-MADDPG offers an alternative information sharing mechanism; CommNet implements an explicit form of communication; MAAC is a recent is a state-of-the-art method in which critics select information to share through an attention mechanism. We analyse the performance of these competing learning algorithms on all the six environments described in Sect. 4.1. In each case, after training, we evaluate an algorithm’s performance by collecting samples from an additional 1000 episodes, which are then used to extract different performance metrics: the reward quantifies how well a task has been solved; the distance from landmarks captures how closely an agent has reached the landmarks; the number of collisions counts how many times an agent has failed to avoid collisions with others; sync occupations counts how many times the landmarks have been occupied simultaneously and, analogously, not sync occupations counts how many times only one of the two landmarks has been occupied. For Waterworld, we also count the number of food targets and number of poison targets. Since this environment requires continuous actions, we cannot use MAAC as this method only operates on discrete action spaces. In Table  1, for each metric, we report the sample average and standard deviation obtained by each algorithm on each environment. A visualization of all results through boxplot can be found in Section B.5 in Supplementary Material.

Table 1 Comparison of MADDPG, MA-MADDPG, CommNet, MAAC and MD-MADDPG on six environments ordered by increasing level of difficulty, from CN to Waterword. The sample mean and standard deviation for 1000 episodes are reported for each metric

All algorithms perform very similarly in the Cooperative Navigation and Partial Observable Navigation cases. This result is expected because these environments involve relatively simple tasks that can be completed even without explicit message-passing and information sharing functionalities. Despite communication not being essential, MD-MADDPG reaches comparable performance to MADDPG and MA-MADDPG. In the Synchronous Cooperative Navigation case, the ability of MA-MADDPG to overcome partial observability issues by sharing the observations across agents seem to be crucial as the total rewards achieved by this algorithm are substantially higher than those obtained by both MADDPG and MD-MADDPG. In this case, whilst not achieving the highest reward, MD-MADDPG keeps the number of unsynchronised occupations at the lowest level, and also performs better than MADDPG on all three metrics. It would appear that in this case pulling all the private observations together is sufficient for the agents to synchronize their paths leading to the landmarks.

When moving on to more complex tasks requiring further coordination, the performances of the three algorithms diverge further in favour of MD-MADDPG. The requirement for strong collaborative behaviour is more evident in the Sequential Cooperative Navigation problem as the agents need to explicitly learn to take either shorter or longer paths from their initial positions to the landmarks in order to occupy them in sequential order. Furthermore, according to the results in Table 1, the average distance travelled by the agents trained with MD-MADDPG is less then the half the distance travelled by agents trained with MADDPG, indicating that these agents were able to find a better strategy by developing an appropriate communication protocol. Similarly, in the Swapping Cooperative Navigation scenario, MD-MADDPG achieves superior performance, and is again able to discover solutions involving the shortest f. Waterworld is significantly more challenging as it requires a sustained level of synchronization throughout the entire episode and can be seen as a sequence of sub-tasks whereby each time the agents must reach a new food target whilst avoiding poison targets. In Table 1, it can be noticed that MD-MADDPG significantly outperforms both competitors in this case. The importance of sharing observations with other agents can also be seen here as MA-MADDPG generates good policies that avoid poison targets, yet in this case, the average reward is substantially lower than the one scored by MD-MADDPG. Experimental settings so far have involved two agents. In addition, we have also investigated settings with a higher number of agents, see Supplementary Material (Section B.2 for Cooperative Navigation and Section B.3 for Partial Observable Cooperative Navigation). These results show, that the prososed method can be successfully used on larger systems without incurring any numerical complications or convergence difficulties. When comparing to other algorithms, MD-MADDPG has resulted in superior performance performance are indeed achieved on Cooperative Navigation with respect to the reward metric. On Partially Observable Cooperative Navigation, there is no definite winner, nevertheless MD-MADDPG shows competitive performance, for example it outperforms all the baselines on the 5 agents scenario.

In Supplementary Material in Section B.4, we provide an ablation study showing that the main components of MD-MADDPG are needed for its correct behaviour. We investigate the effects of removing either one of the key components, i.e. context vector, read and write modules. Removing the context vector reduces the quality of the performance obtained on CN and on environments which require greater coordination efforts, like Sequential CN, Swapping CN and Waterworld. On PO-CN no significant differences in performance are reported, while on Synchronous CN sync occupations worsen (of approximatively five times the amount) and sync occupations improve (of approximatively twice the amount). This result is explained by the fact that in Sync CN, good strategies that do not involve explicit communication can be learnt to achieve good performance on sync occupations. The best overall performance method on this scenario is MA-MADDPG (see Table 1). This comparative method implements an implicit form of communication that is equivalent to a simple information sharing which can be very effective to overcome the partial observability issue which is the main challenge in Sync CN. We have observed that without the writing or reading components the performance worsened on all the run experiments.

Fig. 3
figure 3

Visualisation of communications strategies learned by the agents in four different environments: the three principal components provide orthogonal descriptors of the memory content written by the agents and are being plotted as a function of time. Within each component, the highest values are in red, and the lowest values are in blue. The bar at the bottom of each figure indicates which phase (or sub-task) was being executed within an episode; see Sect. 4.4 for further details. The memory usage patterns learned by the agents are correlated with the underlying phases and the memory is no longer utilised once a task is about to be completed (Color figure online)

4.4 Communication analysis

In this section, we explore the dynamic patterns of communication activity that emerged in the environments presented in the previous section, and look at how the agents use the shared memory throughout an episode while solving the required task. For each environment, after training, we executed episodes with time horizon T and stored the write vector \(\mathbf {m'}\) of each agent at every time step t. Exploring how \(\mathbf {m'}\) evolves within an episode can shed some light onto the role of the memory device at each phase of the task. The analysis presented in this section focuses on the write vector as we expect it to be stronger correlated with the environment dynamics than the other components. The content of the writing vector corresponds to the content of the communication channel itself, and is expected to contain information related to the task (e.g. changes in current environment, agent’s strategy or observed point of interests). A communication analysis with respect to the read vector \({\mathbf {r}}_i\) is presented in Supplementary Material (Section B.8). The content of the reading vector is an implicit representation internal to the agent itself that serves to interpret the content of the channel and at the same time to be utilised in the generation of \(\mathbf {m'}\). In order to produce meaningful visualisations, we first projected the dimensions of \(\mathbf {m'}\) onto the directions maximising the sample variance (i.e. the variance of the observed \(\mathbf {m'}\) across simulated episodes) using a linear PCA.

Figure 3 shows the principal components (PCs) associated with the two agents over time for four of our six simulation environments. Only the first three PCs were retained as those were found to cumulatively explain over \(80\%\) of variance in all cases. The values of each PC were standardised to lie in [0, 1] in order have them in the same in same range for fair comparisons and are plotted on a color map: one is in red and zero in blue. The timeline at the bottom of each figure indicates which specific phase of an episode is being executed at any given time point, and each consecutive phase is coloured using a different shade of grey. For instance, in Sequential Cooperative Navigation, a single landmark is reached and occupied in each phase. In Swapping Cooperative Navigation, during the first phase the agents search and find the landmarks; in the second phase they swap targets, and in the third phase they complete the task by reaching the landmarks again. In the Synchronous Cooperative Navigation the phase indicates if none of the landmarks is occupied (light-grey), if just one is occupied (dark-grey) and if both are occupied (black). Usually, in the last phase, the agents learn to stay close to their targets. This analysis pointed out that in the final phases, when tasks are already completed and there is no need of coordination, the PCs representing the communication activities assume lower (blue values), while during previous phases, when tasks are still to be solved and cooperation is stronger required, they assume higher values (red). This led us to interpret the higher values as being indicative of high memory usage, and lower values as being associated to low activity. In most cases, high communication activity is maintained when the agents are actively working and completing a task, while during the final phases (where typically there is no exploration because the task is considered completed) low activity levels are more predominant.

This analysis also highlights the fact that the communication channel is used differently in each environment. In some cases, the levels of activity alternate between agents. For instance, in Sequential Cooperative Navigation (Fig. 3a), high levels of memory usage by one agent are associated with low ones by the other. A different behaviour is observed for the other environments, indeed in Swapping Cooperative Navigation task where both agents produce either high or low activation value, whereas in Synchronous Cooperative Navigation the memory activity is very intense before the phase three, while agents are collaborating to complete the task. The dynamics characterizing the memory usage also change based on the particular phase reached within an episode. For example, in Fig. 3a, during the first two phases the agents typically show alternating activity levels whilst in the third phase both agents significantly decrease their memory activity as the task has already been solved and there are no more changes in the environment. Figure 3 provides some evidence that, in some cases, a peer-to-peer communication strategy is likely to emerge instead of a master-slave one where one agent takes complete control of the shared channel. The scenario is significantly more complex in Waterworld where the changes in memory usage appear at a much higher frequency due to the presence of very many sequential sub-tasks. Here, each light-grey phase indicates that a food target has been captured. Peaks of memory activity seem to follow those events as the agents reassess their situation and require higher coordination to jointly decide what the next target is going to be. In Supplementary Material (B.1) we provide further experimental results showing the importance of the communication by corrupting the memory content at execution time, which further corroborate the role of the exchanged messages in improving agents’ coordination.

5 Conclusions

In this work, we have introduced MD-MADDPG, a multi-agent reinforcement learning framework that uses a shared memory device as an intra-agent communication channel to improve coordination skills. The memory content contains a learned representation of the environment that is used to better inform the individual policies. The memory device is learnable end-to-end without particular constraints other than its size, and each agent develops the ability to modify and interpret it. We empirically demonstrated that this approach leads to better performance in small-scale (up to 6 agents in our experiments) cooperative tasks where coordination and synchronization are crucial to a successful completion of the task and where world visibility is very limited. Furthermore, we have visualised and analyzed the dynamics of the communication patterns that have emerged in several environments. This exploration has indicated that, as expected, the agents have learned different communication protocols depending upon the complexity of the task. In this study we have mostly focused on two-agent systems to keep the settings sufficiently simple to understand the role of the memory. Very competitive results have been obtained when more agents are used.

In future work, we plan on studying the role played by the sequential order in which the memory is updated, as the number of agents grows. A possible approach may consist of deploying agent selection mechanisms, possibly based on attention, so that only a relevant subset of agents can modify the memory at any given time, or impose master-slave architectures. A possible solution would be to have an agent acting as “scheduler”that controls the access to the memory, decides which information can be shared and provides scheduling for the writing accesses. Introducing such a scheduling agent would allow to keep the current framework unaltered, e.g. the sequential access to the memory would be retained. Although the scheduling agent would add an additional layer of complexity, this might reduce the number of memory access required in larger scale systems and improve the overall scalability. In future work, we will also apply MD-MADDPG on environments characterized by more structured and high-dimensional observations (e.g. pixel data) where collectively learning to represent the environment through a shared memory should be particularly beneficial.