Introduction

Due to the capacity in solving single-agent sequential decision-making tasks, reinforcement learning (RL) methods like Q-learning [47] have attracted increasing research interest in recent years. In such scenarios, the problem can be mathematically modeled as a Markov decision process (MDP) [36]. When it extends to multiagent systems (MAS), multiagent RL (MARL) algorithms have been developed. In the context of MARL, the MDP is extended to an environment where only the joint action of all individual actions can fully determine the state transition [33]. Thus, MARL is a more complex problem, and its learning speed is often not very satisfactory [31].

Knowledge transfer has been proven as a useful tool to accelerate MARL [8, 41]. One popular approach is action advising [1, 25, 35], which focuses on enhancing the performance of learning agents (advisees) according to the expert agents’ (advisors’) policies [51]. Chernova and Veloso [7] proposed a confidence-based knowledge transfer method. They designed a distance threshold to describe the agent’s familiarity with the current state and a confidence threshold to estimate the expected performance with respect to the current state. If the two thresholds are not fully satisfied, the knowledge transfer will be triggered, i.e., the learning agent will ask for action advice from a predefined teacher. Further considering the communication cost in the advising process, Torrey and Taylor [44] proposed an action advising method named Teaching on a Budget. The authors designed several criteria to evaluate when it is proper to offer action advice. With the help of the designed criteria, the performance of the agent was enhanced with a limited amount of advice from the teacher. In these cases, action advising showed satisfactory performance with fixed roles of advisors and advisees, which should be predefined before learning by integrating pre-trained agents with expert knowledge [34, 43] or human demonstrations [11, 42].

However, experts or human demonstrations are not always available in many cases, for example, multiagent path planning problems [13] and smart grid problems [40]. In this kind of cases, the tasks could be totally different, so that experts from other tasks may not perform well; and in these tasks, it is often very hard or expensive for humans to find an optimal solution for demonstration. In other words, it is unfeasible to predefine the role of advisors or advisees [25], and the agents have to learn from scratch. Silva et al. [32] proposed a simultaneously learning and advising method to achieve action advising based on dynamic role assignment. The proposed method allows the agents to exchange their roles between an advisor and an advisee, thereby dynamically determine whether an agent should ask for or give advice according to trigger conditions that are calculated according to Q values and state visit counts (i.e., how many times an agent has encountered a specific state). Hou et al. [14] modeled action advising-based approaches as memetic processes and designed a Q value-based metric to evaluate the timing of providing action advice. Without maintaining visit counts for specific states, this kind of metrics can be adopted in problems with large state space. To avoid the biases from handcrafted trigger conditions, Omidshafiei et al. [25] proposed LeCTR, introducing two more networks for each agent to learn when to request and provide advice, respectively. Since the advice selection problem remains unsettled, LeCTR can only be applied to pairwise scenarios, i.e., only two agents are allowed in the environment.

Although remarkable progress has been made, there are two inherent flaws if we directly adopt the action advising-based methods in learning-from-scratch settings: (1) Limited scalability. For algorithms aiming to solve multiagent problems, it is crucial to keep effective and practical when the number of agents grows up [16, 27]. In the action advising-based methods, the knowledge transfer is conducted in an inquiry-answer manner, which means that an advisor (predefined or dynamically determined) has to put itself in the advisees’ places and calculate solutions for them with extra computational burden. In learning-from-scratch scenarios, the computational load will be magnified, because every agent could be an advisor, making the computational load rise quadratically with the growth of the number of agents. This feature hinders the scalability of the system. A more detailed analysis can be found in Sect. 3.4. (2) Nonstationarity. When there is more than one agent in a shared environment, the behavior of each agent will affect the others’ observation of this environment. In a learning-from-scratch MARL scenario, the behavioral policies of the agents are constantly changing due to the concurrent learning of the agents, making the reaction pattern of the environment seems to be not deterministic, i.e., nonstationary, in the view of each agent [24, 49]. This is a common and crucial issue for MARL with multiple learning agents [23], which changes the learning target of the agents and thus prevents them from converging to a fixed optimum [12]. There are many approaches to solve the nonstationarity. A straightforward way is to disable the experience replay [9, 14, 25], but this will eliminate the data efficiency. Another kind of approach is to collect the observations of all the peers as the policy input [21]. However, it will do harm to the scalability, because the dimension of the state space will rise remarkably with the number of agents. Thus, it is very challenging and necessary to develop a new knowledge transfer approach that can give consideration to both scalability and stationarity.

To this end, we propose a Stationary and Scalable knowledge transfer approach based on Experience Sharing (S\(^2\)ES) to enhance the learning speed of MARL with learning-from-scratch agents. First, to enhance the scalability of the algorithm, we design an augmented form of experience as the expression of knowledge. By actively sharing this augmented experience from one agent to its peers, the inquiry-answer manner can be avoided, making the computational load of the learning system show linear increase with the growth of the system scale. Second, a synchronized learning scheme is designed to reduce nonstationarity. The synchronized learning scheme devotes to making the learning trajectories of the agents identical for stationarity, and the multiagent nature enables it to retain the merits of experience replay. More importantly, the input of an agent’s policy remains to be its local observation, avoiding dimension explosion of the state space when there are many peers. At last, we also design two metrics to determine when the experience should be transferred.

The main contributions of this paper are

  1. 1.

    A novel augmented experience-based experience sharing scheme is proposed for the learning-from-scratch MARL problem, which shows good scalability and stationarity.

  2. 2.

    We propose a new principle in designing trigger conditions for S\(^2\)ES, and design two conditions using this principle to determine when to share the experience.

  3. 3.

    Empirical studies with ablations are provided, which not only validates the proposed algorithm but also clarifies the contributions of each component.

The remainder of this paper is organized as follows. In Sect. 2, we present some background concepts and algorithms. Sect. 3 details the proposed S\(^2\)ES, followed by a numerical analysis of the scalability. Empirical studies and ablations are provided and discussed in Sect. 4, and some concluding remarks and future works are outlined in Sect. 5.

Preliminaries

In this section, we introduce some relevant mathematical backgrounds of RL and MARL.

MDP and SG

MDPs [2] are often used to describe sequential decision-making and can be seen as a formalization of single-agent RL problems [36]. An MDP is defined by a tuple \( \langle {\mathcal {S}}, {\mathcal {A}}, {\mathcal {T}}, {\mathcal {R}}, \gamma \rangle \), where \({\mathcal {S}}\) and \({\mathcal {A}}\) are the state space and action space, respectively. \({\mathcal {T}}(s, a, s'): {\mathcal {S}} \times {\mathcal {A}} \times {\mathcal {S}} \rightarrow [0, 1]\) is the transition probability function, denoting the probability of state transition from s to \(s'\) given action a. \({\mathcal {R}}: {\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathbb {R}}\) is the reward function and \(\gamma \in [0, 1)\) is a discount factor. Usually, the agent in a MDP aims to find an optimal policy \(\pi ^*\) that can maximize the state value function when \(t=0\). The state value function can be given by

$$\begin{aligned} v_\pi (s)={\mathbb {E}}_\pi \left[ \sum _{k=0}^{\infty }\gamma ^kr_{t+k+1}|s_t=s\right] , \end{aligned}$$
(1)

where \(\pi : {\mathcal {S}} \times {\mathcal {A}} \rightarrow [0, 1]\) is the policy that the agent takes, \(v_\pi (s)\) represents the value of state s under policy \(\pi \), \({\mathbb {E}}_\pi [\cdot ]\) calculates the expected value given policy \(\pi \), and \(r_t\) denotes the reward at time t. To evaluate the quality of an action choice in state s, the state-action value function (AKA Q value) is defined as

$$\begin{aligned} Q_\pi (s, a)={\mathbb {E}}_\pi \left[ \sum _{k=0}^{\infty }\gamma ^kr_{t+k+1}|s_t=s, a_t = a\right] . \end{aligned}$$
(2)

When the system is extended to a MAS setting, it can be modeled as a Stochastic Game (SG) [3, 30] (or Markov Game [20]). Assuming the agents can only perceive local state observations rather than the state of the environment, an SG can be formalized as a tuple \(G = \langle n, {\mathcal {S}}, {\mathcal {U}}, {\mathcal {T}}, {\mathcal {O}}, O, {\mathcal {R}}, \gamma \rangle \) [9], in which n denotes the number of agents. \({\mathcal {U}}: {\mathcal {U}}_1 \times \cdots \times {\mathcal {U}}_n\) is the joint action space; for agent i, we have its own action \(u_i \in {\mathcal {U}}_i\) and the joint action \({\varvec{u}} \in {\mathcal {U}}\). \({\mathcal {T}}(s, {\varvec{u}}, s'): {\mathcal {S}} \times {\mathcal {U}} \times {\mathcal {S}} \rightarrow [0,1]\) is the transition probability given a state and a joint action. \({\mathcal {O}}:{\mathcal {O}}_1 \times \cdots \times {\mathcal {O}}_n\) is the joint observation space and O is the observation function. For agent i, it obtains an observation \(o_i = O(s, u_i) : {\mathcal {S}} \times {\mathcal {U}}_i \rightarrow {\mathcal {O}}_i\). \({\mathcal {R}}:{\mathcal {R}}_1 \times \cdots \times {\mathcal {R}}_n\) is the joint reward space, in which \({\mathcal {R}}_i\) is the reward function of agent i.

Q-learning and IQL

Q-learning [47] is a classic RL algorithm for MDPs, which iteratively updates the Q-value according to

$$\begin{aligned} Q_{\pi }(s, a)=Q_{\pi }(s, a)+\alpha \left( y-Q\left( s, a\right) \right) , \end{aligned}$$
(3)

in which \(y = r+\gamma \max _{a^{\prime }} Q\left( s^{\prime }, a^{\prime }\right) \) is the target value, and \(\delta \doteq y-Q\left( s, a\right) \) is the TD error. Employing neural networks to estimate the Q-value, Mnih et al. [22] developed the deep Q-network (DQN). To train the network for better estimations, if the neural network is parameterized by vector \(\varvec{\theta }\), the following loss function should be minimized:

$$\begin{aligned} {\mathcal {L}}\left( \varvec{\theta }\right) ={\mathbb {E}}_{s, a, r, s^{\prime } \sim {\mathcal {D}}}\left[ \left( {y}^--Q\left( s, a ; \varvec{\theta }\right) \right) ^{2}\right] , \end{aligned}$$
(4)

in which \({\mathcal {D}}\) denotes the experience replay buffer, which is adopted to break the correlations of the observation sequence. \({y}^- = r+\gamma \max _{a^{\prime }} Q\left( s^{\prime }, a^{\prime }; \varvec{\theta }^- \right) \) is the target value given by a periodically updated target network with parameter vector \(\varvec{\theta }^-\), aiming to remove the correlations between Q-value and target value. The experience replay and target network together improve the learning stability.

When trying to solve multiagent tasks, one straightforward but efficient way to extend Q-learning to the independent Q-learning (IQL) [39]. In IQL, each agent learns an independent policy according to its own observations and actions. Incorporating neural networks and experience replay, Tampuu et al. [37] introduced DQN to IQL. Since it does not need any centralized computation or global information, this kind of independent learning approaches are well received due to its scalability [9]. However, the adoption of experience replay brings outdated knowledge to the learning agents, making the learning process nonstationarity [24]. In this paper, we aim to effectively accelerate the learning process of independent learners while reducing the nonstationarity.

Since the main contribution of this paper lies in the knowledge transfer procedure, later in this paper, we will take the classic independent DQN as the base algorithm to detail the derivation and implementation of S\(^2\)ES.

S\(^2\)ES

In this section, we introduce the general framework of our proposed S\(^2\)ES, detail how and why the expected goals can be achieved by the proposed schemes, and analyze the scalability of S\(^2\)ES numerically. The mainframe of S\(^2\)ES can be divided into 3 main parts: what kind of experience, how to learn, and when to transfer.

What kind of experience

Fig. 1
figure 1

Demonstration of the different computational load of sharing regular experience and the augmented experience. Purple boxes mark the differences between Fig. 1a, b. \(\textcircled {1}\) Experience generation: We have to calculate the target value y to generate the augmented experience, while for the regular experience, \(\langle o, u, r, o^\prime \rangle \) is enough. \(\textcircled {2}\) Loss calculation: When using the regular experience to calculate the loss, an agent needs to compute y together with the Q value; but for the augmented experience, the target values can be obtained directly from the experience without calculation

In the MARL domain, the effectiveness of the experience sharing scheme has been demonstrated in the literature. Commonly, experience refers to the tuple \(\langle o^i, u^i, r^i, {o}^{\prime i} \rangle \), in which \(o^i\) and \(u^i\) represent state observation and action of agent i, respectively. \(r^i\) is the resultant reward, and \({o}^{\prime i}\) is the state observation at next time step [52]. The time index is omitted for brevity. Wang et al. [45] introduced the experience sharing scheme into the dynamic service composition domain. A centralized supervisor collects the state transitions of the agents and offers action advice to the agents at each step. Similarly, Yasuda and Ohkura [50] designed a centralized controller to learn how to move a swarm of robots. The controller is trained by collecting the experiences of the robots. Compared with the methods above, we focuses on a kind of MAS where the agents learn from scratch without centralized memories or controllers.

In S\(^2\)ES, we first design the “experience” using an augmented form of transition. To enhance the learning stability, a periodically updated target network is usually adopted to help with the calculation of target values \(y^i\), which is responsible to break the correlations between the action values \(Q^i\) and target values \(y^i\) [22]. In this work, we take the networks of the peers as the target networks and design an augmented form of experience: adding target value \(y^i\), rather than its ingredients \(r^i\) and \(o^{\prime i}\), to the experience. Fed by the shared (transmitted) augmented experience from agent j, the parameters of agent i’s value network \(\varvec{\theta }^i\) will be updated according to the loss given by

$$\begin{aligned} {\mathcal {L}}\left( \varvec{\theta }^i\right) ={\mathbb {E}}_{o, u, y^{j} \sim {\mathcal {D}}}\left[ \left( {y}^j-Q^i\left( o, u; \varvec{\theta }^i\right) \right) ^{2}\right] . \end{aligned}$$
(5)

Apparently, the incorporation of \(y^j\) in Eq. 5 is expected to eliminate the correlation between target value and Q value without maintaining target networks for each agent, which should be able to enhance the stability of the algorithm.

Another good feature of sharing the augmented experience is the computational efficiency. Since the target values are pre-calculated by the transmitting agents, the receivers need not do this computation again when reusing shared experience to train their networks. Compared to sharing regular experience \(\left\langle o, u, r, o^{\prime } \right\rangle \), this reduces the computational load. Figure 1 depicts the difference in detail. In each epoch, an agent selects action according to its own observation and policy and gets feedback from the environment. Now, it is sufficient for an agent to obtain and share the regular experience, i.e., state transition \(\left\langle o, u, r, o^{\prime } \right\rangle \) (see Fig. 1a), while it is not before the target value is computed if we would like to share the augmented experience (see Fig. 1b). For agents who have received experience transmitted from the peers, if the experience is in the regular transition form, they should calculate the target value of each piece of experience together with the Q value to further calculate the loss; while if the augmented experience is transmitted, the target values can be obtained directly from the experience, which means that only the Q values should be computed then. This explains why sharing the augmented experience will reduce the computational burden.

At the same time, since S\(^2\)ES shares experience actively without considering the status of the peers, its computational load is naturally lower than the action advising-based knowledge transfer in which the advisors have to undertake extra computations for the advisees. More importantly, compared to the action advising-based methods, the actively sharing mechanism in S\(^2\)ES provides better scalability with respect to the number of learning agents. More detailed analysis and comparison of the scalability will be given later in Sect. 3.4.

How to learn

Having determined what kind of experience to share, we now discuss how the received experience should be used in the learning process.

It is natural if we simply introduce the concept of experience replay in single-agent deep RL, saving and sampling experience in a replay buffer, which is expected to improve the data efficiency and stability. However, it will bring nonstationarity, because the sampled experience may be generated by old-fashioned policies [9, 26].

Nonstationarity is a key challenge for independent learners [19]. Since the independent learners take the state of the peers as part of the environment, when the policies of the peer agents are changed, the state transition probability will change in the view of an agents local perspective, which brings about nonstationarity.

Formally, according to [19] and [24], the decision-making process of agent i in an MAS is stationary, iff, for any time \(k, l \in {\mathbb {N}}\), given \(o, o^{\prime } \in {\mathcal {O}}_i\) and \({\varvec{u}} \in {\mathcal {U}}\)

$$\begin{aligned} \begin{aligned}&P\left[ {o}_{k+1}=o^{\prime } \mid {\varvec{u}}_{k}=\left\langle u^i, {\varvec{u}}^{-i}_k \right\rangle , {o}_{k}=o\right] \\&\quad = P\left[ {o}_{l+1}=o^{\prime } \mid {\varvec{u}}_{l}=\left\langle u^i, {\varvec{u}}^{-i}_l \right\rangle , {o}_{l}=o\right] , \end{aligned} \end{aligned}$$
(6)

in which \({\varvec{u}}^{-i}={\varvec{u}} \backslash \left\{ u^{i} \right\} \), i.e., the joint action of all the agents except agent i.

One possible solution is to disable the experience replay [10], avoiding the effects of the outdated experience. This brings a difficult choice between using experience replay for better stability and data efficiency and disabling it for better stationarity.

Here, we propose a synchronized learning scheme, trying to avoid the nonstationarity brought by old experience while at the same time retaining the stability and sample efficiency to some extent. Specifically, each agent in S\(^2\)ES maintains an independent replay buffer, \(\hat{{\mathcal {D}}}^i\) for agent i, which only stores the self-generated and received experience at the current time step. At every time step, agent i updates \(\varvec{\theta }^i\) according to Eq. 5 using all the experience in the buffer, i.e., the batch size is equal to the max size of replay buffer. Then, at the beginning of each time step, the replay buffer will be cleaned up for the upcoming new experience.

With the help of the designed synchronized learning, the agents no longer use the experience generated by old policies, and thus, the environment should be stationary in agents’ local view [9]. On the other hand, although the sampling efficiency is reduced, because the outdated experience is discarded, the shared experience can make up for this loss to some extent. Meanwhile, with S\(^2\)ES, the instability caused by the sequential correlation of the experience will no longer exist, because the experience in \(\hat{{\mathcal {D}}}^i\) are generated from different decision-making processes of different agents.

When to transfer

There exist many MARL scenarios in which the system is sensitive to the communication cost in the learning process, such as real-world training or learning systems deployed on computer clusters. In these cases, too much communication among agents in the knowledge transfer process will instead lower the learning speed [44]. Hence, we have to further modify S\(^2\)ES to reduce the communication cost.

As detailed above, the action advising-based methods require two-way communication—the advisees need to first broadcast their own observations, and then, the advisors will send advice back. In contrast, S\(^2\)ES devotes to accelerating the learning process by actively sharing experience. This can be achieved in a one-way manner, indicating that the communication of S\(^2\)ES has been naturally reduced by half. Meanwhile, compared to the regular experience \(\left\langle o, u, r, o^{\prime } \right\rangle \), the augmented experience has its advantage in packet size.

Here, we try to further reduce the communication by introducing two metrics to trigger the experience sharing when necessary. In general, current works focus on two types of metrics: state visit counts based [32] and Q value-based [14, 44]. The state visit count-based methods assume that an agent is more skilled in handling the states which have been visited more. This requires each agent to keep a count for every possible state, which may be unfeasible when the state space is extremely large. The Q-based approaches are not affected by the state space, but before the policies of the agents are converged, the Q values will be noisy, and thus, Q-based metrics will be groundless in the early learning stage [1, 53].

However, considering the exploration–exploitation trade-off in the RL process, we believe that the early learning stage is just a period when the knowledge transfer should be encouraged for better exploration, which has similar insights to the anneal process in \(\varepsilon \)-greedy; as learning proceeds, the policies of the agents are expected to converge and the Q values will get more accurate, which makes it reasonable to rely on the Q value then.

On this basis, we first design a S\(^2\)ES method with Q value based trigger condition, namely S\(^2\)ES-Q. Since an agent needs to know the Q values of the peers, the Q value should be added to the augmented experience, so the augmented experience of S\(^2\)ES-Q can be written as \(\left\langle o^i, u^i, Q^i, y^i \right\rangle \)Footnote 1 for agent i. Then, inspired by the Sigmoid function, a modifying function \(f(\cdot )\) is designed to modify the Q value, which is given by

$$\begin{aligned} f\left( Q^{ij},\tau \right) = \frac{Q^{ij} }{1+ e^{-a(\tau - b)}}, a,b > 0, \tau \in {\mathbb {N}}^+ , j \in {\mathcal {N}}_i, \end{aligned}$$
(7)

where \(Q^{ij}\) denotes the latest Q value that agent i has received from agent j, a and b are tuning parameters of the modifying function, \(\tau \) is the number of episodes that the agents have been experienced, and \({\mathcal {N}}_i\) is a set of the neighbors of agent i. Denoting \(n_i\) as the number of agent i’s neighbors, the trigger condition of experience sharing can be written as

$$\begin{aligned} Q^i > \frac{1}{n_i}\sum _j f\left( Q^{ij},\tau \right) . \end{aligned}$$
(8)

It is obvious that the modifying function 7 will lower the Q-value in the early stage to make the condition 8 satisfied more frequently, triggering more experience sharing in the early stage. While later, \(f\left( Q^{ij},\tau \right) \) will gradually converge to \(Q^{ij}\), which means only experience with higher state-action value than the average will be shared. This feature will reduce communication significantly in the later stage.

Shannon entropy is another metric that can describe the action quality in the aspect of confidence in the decision-making process. Thus, we try to adopt the normalized Shannon entropy in [28] to the S\(^2\)ES framework, named as S\(^2\)ES-H. The normalized Shannon entropy can be written as

$$\begin{aligned} \mathrm {H}^i =-\sum _{m=0}^{M} \frac{p\left( u^i_{m}\right) \cdot \log p\left( u^i_{m}\right) }{\log M}, \end{aligned}$$
(9)

in which \(\mathrm {H}^i\) is the normalized entropy of agent i’s action distribution, M is the number of possible actions of agent i, i.e., the dimension of \({\mathcal {U}}_i\), and \(p\left( u^i_{m}\right) \) is defined as

$$\begin{aligned} p\left( u^i_{m}\right) = \frac{Q^i\left( o, u^i_{m}; \varvec{\theta }^i\right) }{\sum _{m=0}^{M}{Q^i\left( o, u^i_{m}; \varvec{\theta }^i\right) }}. \end{aligned}$$
(10)

The key design principle of the modifying function in S\(^2\)ES-Q is to encourage experience sharing in the early stage, because the Q value-based metric may be inaccurate and more exploration should be conducted at that time. Similarly, the entropy of actions can also be inaccurate in the early learning stage, since the \(\mathrm {H}^i\) is also calculated by Q values, as described in Eq. 10. Thus, in S\(^2\)ES-H, this principle should also be followed. For simplicity, we use the same modifying function \(f(\cdot )\) as in S\(^2\)ES-Q, and the trigger condition of S\(^2\)ES-H can be given by

$$\begin{aligned} f\left( \mathrm {H}^i,\tau \right) = \frac{\mathrm {H}^i }{1+ e^{-a(\tau - b)}}\, \langle \mathrm {H}_T, a,b \rangle \, 0, \tau \in {\mathbb {N}}^+, \end{aligned}$$
(11)

where \(\mathrm {H}_T\) is a fixed threshold.

With the trigger condition 11, experience sharing will be triggered more frequently in the early learning stage; as learning proceeds, \(f\left( \mathrm {H}^i,\tau \right) \) converges to \(\mathrm {H}^i\) gradually, and only experience generated by highly confident actions will be shared. Since the value of normalized Shannon entropy is strictly limited to \(\left( 0,1 \right] \), it is convenient and reasonable to find a fixed threshold to trigger the communication, avoiding the inclusion of trigger-condition-related information in the augmented experience, like the \(Q^i\) in S\(^2\)ES-Q. Therefore, the augmented experience in S\(^2\)ES-H remains to be \(\left\langle o^i, u^i, y^i \right\rangle \). This good feature explains why we choose the normalized entropy rather than the vanilla Shannon entropy in this work.

However, we have to point out that the metrics based on Eq. 8 are just intuitive examples to validate the basic ideas of this paper. A more systematic derivation of the trigger condition should be investigated in the future.

Fig. 2
figure 2

Workflow of S\(^2\)ES

We can now give the overall workflow of S\(^2\)ES, as shown in Fig. 2. At time t, each agent in the system perceives the state of the environment \(s^t\) and generates the corresponding observation, i.e., \(o^t_1\) for agent 1. Then, the action \(u^t_1\) is generated according to the policy of agent 1. At the same time, \(o^t_1\) and \(u^t_1\) are saved to the augmented experience; the corresponding \(Q^t_1\) should also be saved if we use S\(^2\)ES-Q rather than S\(^2\)ES-H. After getting feedback from the environment, the agent will further calculate the target value \(y^t_1\) and save it to finally form the augmented experience \(E^t_1\). If \(E^t_1\) is, in the view of agent 1 itself, qualified to share, then it will be transmitted to the peer agents. Each agent maintains a replay buffer which only stores the augmented experience at the current time step, which is aimed to reduce nonstationarity. Algorithm 1 provides the pseudocode of the proposed S\(^2\)ES.

figure a

Analysis of scalability

In deep RL, the main computational load lies in the decision-making (forward) and learning (backpropagation) processes, both of which are related to the size of the network. In the following analysis, all the agents are assumed to use the same neural networks.

Assuming S\(^2\)ES is utilized in an MAS with n learning agents, each agent conducts one round of both decision-making and training at each time step. For a single agent, denoting the computational load of one decision-making process as \(c_{\text {f}}\) and one backpropagation as \(c_{\text {b}}\), the total computational load of S\(^2\)ES in a step can be written as

$$\begin{aligned} T_{S^2{\text {ES}}} = n \cdot c_{\text {f}} + n \cdot c_{\text {b}} = {\mathcal {C}} \cdot n, \end{aligned}$$
(12)

where \({\mathcal {C}}\) is a constant. We can find that the adoption of S\(^2\)ES does not affect the time complexity of the system, which remains to be O(n).

As for the action advising-based approaches, assuming that there is a probability \(p_{\text {ask}}\) for each agent to raise an inquiry, and a probability \(p_{\text {ans}}\) for agents who received the inquiries to provide advice to the advisees, for each step in an MAS with n learning agents, the total computational load can be given as

$$\begin{aligned} \begin{aligned} T_{\text {AA}}&= n \cdot c_{\text {f}} + n \cdot c_{\text {b}} + p_{\text {ask}}\cdot n \cdot p_{\text {ans}} \cdot (n-1) \cdot c_{\text {f}} \\&={\mathcal {C}}_1 \cdot n^2 + {\mathcal {C}}_2 \cdot n, \end{aligned} \end{aligned}$$
(13)

where \({\mathcal {C}}_1\) and \({\mathcal {C}}_2\) are constants. The third term of Eq. 13 depicts the extra computation induced by action advising. Therefore, the time complexity of action advising-based methods is \(O(n^2)\).

To summarize, the computational load of MARL with S\(^2\)ES increases linearly with the growth of the scale of the MAS, which indicates that S\(^2\)ES will not affect the scalability of the learning system. While for the action advising-based approaches, it shows quadratic increase under the same settings. This indicates better scalability of S\(^2\)ES.

Empirical study

In this section, we will first compare the proposed S\(^2\)ES methods (including S\(^2\)ES-Q and S\(^2\)ES-H) with other methods empirically, analyzing the difference in mission performance and computational load. Then, ablation studies are provided to demonstrate the impact of each component. At last, the scalability of S\(^2\)ES will be verified.

To validate the effectiveness and efficiency of S\(^2\)ES, we utilize the classic minefield navigation tasks (MNT) platform [38, 48]. The MNT simulates a grid minefield environment with one flag (target) and several tanks (agents) and bombs (obstacles) randomly distributed in the field. The mission of the tanks is to navigate to the target while avoiding collisions. Once a tank arrives at the target within the time limit, it is counted as one successful event; otherwise, if a tank encounters collisions, it fails the mission and gets stuck to the spot until the next episode. One episode ends when the conditions of all the agents are determined, i.e., successful or failed, or when the time runs out. Then, the positions of the agents, obstacles, and destination will be randomly reset to start a new episode. A typical scenario of the MNT mission is given in Fig. 3.

Fig. 3
figure 3

Illustration of an MNT mission

In an MNT mission, global information of the environment is not available to the agents. Instead, the agents are equipped with three sets of detecting sonars that can observe local information. The bombs and agents sonar sets detect the bombs and agents in five directions—left (L), left front (LF), front (F), right front (RF), and right (R), obtaining the relative distance and bearing of the bombs and peer agents. As for the target, only the relative bearing can be obtained, but the target sonar set is equipped in all 8 directions, i.e., L, LF, F, RF, R, left-back, back, and right-back, which means that this bearing of the target is available in any direction. This meets the real-world sensing of not only autonomous tanks or unmanned ground vehicles (UGVs) [6, 18], but also other unmanned agents like autonomous underwater vehicles (AUVs) [4, 5, 46]. The agents can only move one grid at once and not allowed to move backwards, so we can formalize the action space of agent i as \({\mathcal {U}}_i = \text {\{L, LF, F, RF, R\}}\).

All the agents use fully connected multilayer perceptrons with one hidden layer composed of 36 neurons and run DQN independently with a learning rate of 0.5. The \(\varepsilon \)-greedy strategy is adopted for exploration, in which \(\varepsilon \) is initialized as 0.5 and anneals linearly to 0.005. The hyper-parameters of the modifying function are set as \(a=0.001\) and \(b=5000\).

Fig. 4
figure 4

Policy enhancement in the learning process. The subfigures are snapshots of 5 S\(^2\)ES-Q agents in MNT missions after 0, 5000, 10,000, and 30,000 episodes, respectively. The circles denote the starting positions of the agents and the lines with arrows track their trajectories

Comparison with existing algorithms

In this section, we compare the performance of the proposed S\(^2\)ES with action advising-based method eTL, episode sharing [39], and deep IQL with no knowledge transfer.

AdHocTD, AdHocVisit [32], eTL [14], and LeCTR [25] are some of the most classic action advising based algorithms that investigate the knowledge transfer problem for MARL with no experts, in which AdHocTD and AdHocVisit are based on state visit counts, while eTL is based on Q values. However, for an MNT task, the state space is very large. The number of states for each agent in MNT can be given by

$$\begin{aligned} {n_{\text {Grid}}}^{n_{\text {Sonar}}} \cdot n_{\text {TargetBearing}}, \end{aligned}$$
(14)

where \(n_{\text {Grid}}\) is the number of grids on each side (i.e., the resolution of the sonars), \(n_{\text {Sonar}}\) denotes the number of the sonars, and \(n_{\text {TargetBearing}}\) is the number of possible relative directions of the destination. For a \(16 \times 16\) sized map, the number of states is \(16^{10} \times 8\), which makes counting the visiting times of each state unfeasible. At the same time, LeCTR is only designed for pairwise scenarios, which cannot be applied to MAS with more than two agents yet. Thus, we select eTL as the representation of action advising-based methods.

Table 1 Time consumption of the compared algorithms (in hours)
Fig. 5
figure 5

Success rates of the compared algorithms

In the following experiments, we set the minefield as a \(16 \times 16\) grid world, and the length of each episode is limited to 30 steps. There are 30000 episodes in each experiment, and the performance is evaluated over intervals of 100 episodes. Experiments are conducted in four MNT scenarios of different complexity, and the results are averaged over 50 independent runs.

Figure 4 shows the ability enhancement of S\(^2\)ES-Q learning agents in a learning process. Figure 4a tracks the trajectories of the agents without learning, in which the agents move blindly and end up with 0 rewards (the blue and cyan trajectories guide the tanks to the bomb, while the others make the tanks wander in the field until 30 steps are used up). As the learning process proceeds, the agents gradually learn to approach the target while avoiding bombs. As shown in Fig. 4b–d , a clear tendency can be found that the agents become more and more likely to find collision-free short paths to the destination.

Figure 5 demonstrates the performance of the methods in MNT scenarios with 3 agents and 5 bombs, 5 agents and 10 bombs, 10 agents and 5 bombs, and 15 agents and 3 bombs, respectively. The lines are plotted according to the mean success rate of the independent runs, while the shadows denote the standard deviation (STDV). We can find that, in all of the provided scenarios, the success rates of both S\(^2\)ES-Q and S\(^2\)ES-H are higher than the other approaches at both the early stage and the following convergence phase, indicating that S\(^2\)ES accelerates the learning speed better than the other knowledge transfer approaches. At the same time, we can also find that S\(^2\)ES-Q has a similar performance to S\(^2\)ES-H, showcasing the rationality of the design principle of the trigger condition. In addition, the STDV of S\(^2\)ES methods are also lower that the other methods.

Fig. 6
figure 6

Number of communication of the compared algorithms

Table 2 The number of transmitted packages of the compared algorithms

Table 1 details the total computation time of the 50 independent runs. The data are generated on the same computer with Intel\(^\circledR \) Core(TM) i7-8700K CPU @3.70 GHz and an RAM of 32 GB. From this table, we can find that S\(^2\)ES (including S\(^2\)ES-Q and S\(^2\)ES-H) achieves the best performance with the lowest computation time compared with the other algorithms. It is inevitable that the interaction between agents increases the computational time. However, since S\(^2\)ES brings little extra computational load and the significant performance enhancement of S\(^2\)ES reduces the average length of the episodes, its computing time is even shorter than the algorithm with no communication (IQL). When using the action advising-based method eTL, the advisors have to conduct extra calculations to generate the advice for the advisees; while when using S\(^2\)ES, the potential advisors share experience based solely on their own observation rather than considering the others’ status. This causes the difference in computational time between S\(^2\)ES and eTL. The long computing time of Episode Sharing lies in the learning of too many episodes and the low success rate.

The number of communication during the whole learning process is illustrated in Fig. 6. We can find that S\(^2\)ES-Q, S\(^2\)ES-H, and eTL show downward trends in the learning process, which is due to the decrease of the number of steps in each episode brought by the policy improvement. However, the trends of S\(^2\)ES algorithms are more significant thanks to the trigger condition in Eqs. 8 and 11 . Note that although the episode sharing method shows a small number of communication at the early stage, it transmits all the transitions of an episode at once. If we define the size of a state transition as the standard size of a package, Table 2 provides the average number of transmitted packages in an episode. It is clear that the transmitted packages of both S\(^2\)ES-Q and S\(^2\)ES-H are much less than the other knowledge transfer approaches, which indicates that S\(^2\)ES has the lowest communication traffic.

To summarize, our proposed S\(^2\)ES-Q and S\(^2\)ES-H can efficiently accelerate the learning process in the learning-from-scratch scenario, and achieves better performance compared with the other popular knowledge transfer approaches. Compared to the action advising-based and episode sharing-based knowledge transfer approaches, S\(^2\)ES-Q and S\(^2\)ES-H can achieve better acceleration performance with lower computational load and communication cost.

Ablations

To confirm whether each part of S\(^2\)ES works as expected, we further provide a series of ablation studies in the MNT scenario with 5 agents and 10 bombs.

Figure 7 demonstrates the effectiveness of sharing the augmented experience. Compared to the deep IQLFootnote 2 without any information exchanging among agents, the learning speed is significantly enhanced in the early learning stage by sharing regular experience and using them in a conventional experience replay manner (ES-EXP-ER). However, the incorporation of replay buffer brings nonstationarity, which causes the significant performance drop. When the agents share the augmented experience but remain saving and sampling them in the experience replay (ES-AEXP-ER), the learning speed in the early stage is further enhanced. Moreover, the learning curve of ES-AEXP-ER drops less dramatically than that of ES-EXP-ER, indicating that the the stability is improved. This matches the motivation of the design of augmented experience, which is that the shared target value will break the correlation between Q and target value. Since the nonstationarity problem remains unsolved, the learning target keeps changing and the performance still drops with learning.

Fig. 7
figure 7

Ablation for augmented experience

The performance of the synchronized learning scheme can be indicated in Fig. 8. By incorporating synchronized learning, even sharing regular experience (ES-EXP-SYNC) can achieve better performance than ES-EXP-ER. More importantly, we should note that ES-EXP-SYNC converges to a similar optimum to eTL without performance dropping. Since eTL has no concern about nonstationarity, because it does not use experience replay, it is fair to say that the incorporation of synchronized learning can reduce the nonstationarity as we expected.

Fig. 8
figure 8

Ablation for synchronized learning

The influence of the event triggering scheme is demonstrated in Fig. 9. Here, we add S\(^2\)ES without event trigger (i.e., sharing all the augmented experience, namely ES-AEXP-SYNC-ALL) for comparison. Comparison of the number of communication is provided in Fig. 10. We can find that with the event triggering schemes, the communication is significantly reduced, especially in the late learning stage, while the performance remains satisfactory. This is a good feature for systems that are sensitive to the communication cost in the learning process. The trade-off between communication and learning performance can be adjusted by tuning a and b in Eq. 7 according to the user’s preference.

Fig. 9
figure 9

Ablation for event trigger

Fig. 10
figure 10

Comparison of communication with or without event trigger

Validation of scalability enhancement

In this section, the scalability of the proposed S\(^2\)ES is investigated. Since S\(^2\)ES-H shows similar performance to S\(^2\)ES-Q, and the different choice of trigger condition will not affect the scalability of the algorithm, in this part, we take the performance of S\(^2\)ES-Q as the representation of S\(^2\)ES for clarity and use the name of S\(^2\)ES in the following figures and analysis for generalization.

The test environment is an extension of MNT with a \(25 \times 25\) grid world and 10 mines. 5, 10, 15, 20, 25, and 30 learning agents are deployed in this environment, respectively. Each episode has a timeout of 50 steps, and the results are evaluated every 100 episodes. There are 15000 episodes in each simulation run, and 50 independent runs are carried out.

Fig. 11
figure 11

Mean success rates for different number of agents

Keeping effective is a primary requirement when analyzing scalability [17]. Figure 11 shows the mean success rates over each episode in the 50 runs. The boxes and dots in the figure illustrate that S\(^2\)ES outperforms eTL in terms of both the overall success rate and variance, regardless of the number of agents.

Fig. 12
figure 12

Computational complexity with different number of agents

Computational load is a crucial feature when evaluating the scalability of an algorithm. If the computational load increases dramatically with the growth of the system scale, it will become impractical for the learning method to be applied to large-scale multiagent systems. Time consumption of eTL and S\(^2\)ES with the different number of agents is given in Fig. 12, where the time refers to the total time of 50 independent runs. From this figure, we can find that the computational load of S\(^2\)ES increases linearly as the number of agents increases, while that of eTL grows quadratically. This result matches our analysis in Sect. 3.4, which can be attributed to the different knowledge transfer mechanisms. For agents running S\(^2\)ES, by providing “good” knowledge actively, no extra computational load have to be taken to generate advice; while in contrast, agents with action advising-based knowledge transfer have to make decisions for both themselves and their peers.

Fig. 13
figure 13

Number of communication with different numbers of agents

Communication is another key factor. Figure 13 compares the average number of communication of eTL and S\(^2\)ES in each episode, from which we can find that the communication of these two approaches differs by orders of magnitude. Meanwhile, it is notable that the communication of eTL also increases quadratically, while that of S\(^2\)ES increases only linearly.

To summarize, the simulation results on the MNT platform indicate that S\(^2\)ES outperforms the action advising-based methods in terms of both performance and scalability. A general test is conducted to show the effectiveness and efficiency of S\(^2\)ES, while ablations further clarify the contributions of each part of S\(^2\)ES. Moreover, we also provide simulations to show that, compared to the action advising-based methods, S\(^2\)ES has better scalability with respect to both computational load and communication cost.

Conclusion

Devoting to accelerating MARL in scenarios where all the agents learn from scratch, we propose a stationary and scalable knowledge transfer approach based on experience sharing, namely S\(^2\)ES, to conduct knowledge transfer among agents in the learning process.

The main structure of S\(^2\)ES is divided into what kind of experience, how to learn, and when to transfer. Specifically, we first design an augmented form of experience, which is able to effectively enhance the learning speed. By actively sharing the augmented experience to the peers, the knowledge can be transferred without extra computing for the peers, which further brings high computing efficiency and scalability. A synchronized learning scheme is then introduced, by which the agents only learn through the experience at the current time. This avoids the nonstationarity brought by the conventional experience replay and, at the same time, retains data efficiency to some extent. At last, considering there are some MARL scenarios that are sensitive to the communication cost in the learning process, we further design an event triggering scheme to determine when to share the augmented experience to the peers. Taking the accuracy of Q value into account, two trigger conditions are provided, one is Q value based and another is normalized entropy-based.

Empirical studies in MNT scenarios with different complexity demonstrate that S\(^2\)ES achieves effective acceleration for learning-from-scratch MARL and outperforms the other popular approaches. Ablation studies are provided to further confirm the credits of each part for the performance enhancement, which matches well with our expectations. We also present performance analysis when the learning system scales up, which not only shows that S\(^2\)ES keeps performing well in different settings, but also confirms that the computational load of S\(^2\)ES increases only linearly with the growth of the number of agents, indicating better scalability than the action advising-based approaches.

In the future, the problem of when to transfer should be further investigated. More delicate triggering condition should be provided, which may consider communication cost, state familiarity, negative transfer, and the accuracy of the state-action value as a whole. Moreover, for better application of S\(^2\)ES, experience sharing scheme for more complex tasks, such as StarCraft [29] and PO-MNT [15], should also be studied.