Learning multi-agent communication with double attentional deep reinforcement learning

Mao, Hangyu; Zhang, Zhengchao; Xiao, Zhen; Gong, Zhibo; Ni, Yan

doi:10.1007/s10458-020-09455-w

Learning multi-agent communication with double attentional deep reinforcement learning

Published: 25 March 2020

Volume 34, article number 32, (2020)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Hangyu Mao¹,
Zhengchao Zhang¹,
Zhen Xiao¹,
Zhibo Gong² &
…
Yan Ni¹

1889 Accesses
22 Citations
1 Altmetric
Explore all metrics

Abstract

Communication is a critical factor for the big multi-agent world to stay organized and productive. Recently, Deep Reinforcement Learning (DRL) has been adopted to learn the communication among multiple intelligent agents. However, in terms of the DRL setting, the increasing number of communication messages introduces two problems: (1) there are usually some redundant messages; (2) even in the case that all messages are necessary, how to process a large number of messages in an efficient way remains a big challenge. In this paper, we propose a DRL method named Double Attentional Actor-Critic Message Processor (DAACMP) to jointly address these two problems. Specifically, DAACMP adopts two attention mechanisms. The first one is embedded in the actor part, such that it can select the important messages from all communication messages adaptively. The other one is embedded in the critic part so that all important messages can be processed efficiently. We evaluate DAACMP on three multi-agent tasks with seven different settings. Results show that DAACMP not only outperforms several state-of-the-art methods but also achieves better scalability in all tasks. Furthermore, we conduct experiments to reveal some insights about the proposed attention mechanisms and the learned policies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning controlled and targeted communication with the centralized critic for the multi-agent system

Article 04 November 2022

A survey of multi-agent deep reinforcement learning with communication

Article Open access 06 January 2024

ACM: Learning Dynamic Multi-agent Cooperation via Attentional Communication Model

Notes

It is a modification of our ACML [17] accepted by AAAI-2020.
It is the same as that of our ATT-MADDPG [18] accepted by AAMAS-2019.
Formally, $P(s', r_i|\mathbf {o},\mathbf {a},{\varvec{\pi }}) = P(s', r_i|s,a_1,\ldots ,a_N,\pi _1,\ldots ,\pi _N) = P(s', r_i|s,a_1,\ldots ,a_N) = P(s', r_i|s,a_1,\ldots ,a_N,\pi '_1,\ldots ,\pi '_N)$ for any π_i ≠ π′_i. Please refer MADDPG [33] for details.
The detailed derivation can be found in [56].
The expectation is equivalent to the weighted summation, and the weight of $Q_{i}^{\pi _i}(s,a_i,\mathbf {a}_{-i})$ is ${\varvec{\pi }}_{-i}(\mathbf {a}_{-i}|s)$ as shown in Eq. (10).
This is why we use $Q_{i}^{k}(s,a_i|\mathbf {a}_{-i};w_i)$ instead of $Q_{i}^{k}(s,a_i,\mathbf {a}_{-i};w_i)$ to represent the defined action conditional Q-value.
Please note that $M_i$ is a weighted summation of all other local messages $m_{j \wedge j \ne i}$, while $m_j$ is an encoding of $o_j$. Therefore, $[m_i|M_i]$ has all the necessary information contained in $\langle o_{i}, \mathbf {o}_{-i} \rangle$, which means that the shared representation learning will not lose important information about $\langle o_{i}, \mathbf {o}_{-i} \rangle$ if the model is well-trained. In contrast, it can bring many benefits, e.g., data efficiency, robust training, and so on.
The detailed advantages of minimizing MLU are discussed in [58].
There are two exceptions. The first one is that ACMP-AA underperforms ACMP on the cooperative navigation task when $N=2$. The other one is that ACMP-AA underperforms ACMP on traffic control tasks. As analyzed before, the reason of the former exception is that this setting is too simple to leave space for advanced methods to improve on, while the reason of the latter exception is that traffic control task has random biases going against the property of ACMP-AA.
Recall that the 2D plane is bounded. The agent’s next position is calculated by $p_{t+1} = \langle (p_x+v_x)\%10, (p_y+v_y)\%10 \rangle$.
As mentioned in Sect. 4.3.3, the Q-value heads are 32D vectors, so we merge the last two layers of the critic network to transform the vector into a scalar Q-value shown in Fig. 13a. The detailed transformation process is shown in the “Appendix 2”.

References

Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning (Vol. 135). Cambridge: MIT Press.
MATH Google Scholar
Tan, M. (1993). Multi-agent reinforcement learning: Independent versus cooperative agents. In Proceedings of the tenth international conference on machine learning (pp. 330–337).
Wu, F., Zilberstein, S., & Chen, X. (2011). Online planning for multi-agent systems with bounded communication. Artificial Intelligence, 175(2), 487–511.
Article MathSciNet Google Scholar
Zhang, C., & Lesser, V. (2013). Coordinating multi-agent reinforcement learning with limited communication. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, international foundation for autonomous agents and multiagent systems (pp. 1101–1108).
Roth, M., Simmons, R., & Veloso, M. (2005). Reasoning about joint beliefs for execution-time communication decisions. In Proceedings of the fourth international joint conference on autonomous agents and multiagent systems, ACM (pp. 786–793).
Roth, M., Simmons, R., & Veloso, M. (2006). What to communicate? Execution-time decision in multi-agent pomdps. In Distributed autonomous robotic systems (Vol. 7, pp. 177–186). Berlin: Springer.
Sukhbaatar, S., Fergus, R., et al. (2016). Learning multiagent communication with backpropagation. In Advances in neural information processing systems (pp. 2244–2252).
Foerster, J., Assael, Y. M., de Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. In Advances in neural information processing systems (pp 2137–2145).
Peng, P., Yuan, Q., Wen, Y., Yang, Y., Tang, Z., Long, H., & Wang, J. (2017). Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:170310069.
Mao, H., Gong, Z., Ni, Y., & Xiao, Z. (2017). Accnet: Actor-coordinator-critic net for “learning-to-communicate” with deep multi-agent reinforcement learning. arXiv preprint arXiv:170603235.
Kong, X., Xin, B., Liu, F., & Wang, Y. (2017). Revisiting the master-slave architecture in multi-agent deep reinforcement learning. arXiv preprint arXiv:171207305.
Kilinc, O., & Montana, G. (2019). Multi-agent deep reinforcement learning with extremely noisy observations. In International conference on learning representations.
Kim, D., Moon, S., Hostallero, D., Kang, W. J., Lee, T., Son, K., & Yi, Y. (2019). Learning to schedule communication in multi-agent reinforcement learning. In International conference on learning representations. https://openreview.net/forum?id=SJxu5iR9KQ.
Singh, A., Jain, T., & Sukhbaatar, S. (2019). Individualized controlled continuous communication model for multiagent cooperative and competitive tasks. In International conference on learning representations. https://openreview.net/forum?id=rye7knCqK7.
Kim, W., Cho, M., & Sung, Y. (2019). Message-dropout: An efficient training method for multi-agent deep reinforcement learning. arXiv preprint arXiv:190206527.
Mao, H., Gong, Z., Zhang, Z., Xiao, Z., & Ni, Y. (2019). Learning multi-agent communication under limited-bandwidth restriction for internet packet routing. arXiv preprint arXiv:190305561.
Mao, H., Zhang, Z., Xiao, Z., Gong, Z., & Ni, Y. (2020). Learning agent communication under limited bandwidth by message pruning. In AAAI 2020.
Mao, H., Zhang, Z., Xiao, Z., & Gong, Z. (2019). Modelling the dynamic joint policy of teammates with attention multi-agent DDPG. In Proceedings of the 18th international joint conference on autonomous agents and multiagent systems, ACM.
Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of MDP. Mathematics of Operations Research, 27(4), 819–840.
Article MathSciNet Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Article Google Scholar
Konda, V. R., & Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in neural information processing systems (pp. 1008–1014).
Konda, V. R., & Tsitsiklis, J. N. (2003). On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4), 1143–1166.
Article MathSciNet Google Scholar
Grondman, I., Busoniu, L., Lopes, G. A., & Babuska, R. (2012). A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 1291–1307.
Article Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In ICML.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971.
Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204–2212).
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057).
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:150804025.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
Pynadath, D. V., & Tambe, M. (2002). The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of Artificial Intelligence Research, 16, 389–423.
Article MathSciNet Google Scholar
Goldman, C. V., & Zilberstein, S. (2004). Decentralized control of cooperative systems: Categorization and complexity analysis. Journal of Artificial Intelligence Research, 22, 143–174.
Article MathSciNet Google Scholar
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems (pp. 6379–6390).
Chu, X., & Ye, H. (2017). Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:171000336.
Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., & Whiteson, S. (2017). Counterfactual multi-agent policy gradients. arXiv preprint arXiv:170508926.
Peng, Z., Zhang, L., & Luo, T. (2018). Learning to communicate via supervised attentional message processing. In Proceedings of the 31st international conference on computer animation and social agents, ACM (pp. 11–16).
Jiang, J., & Lu, Z. (2018). Learning attentional communication for multi-agent cooperation. arXiv preprint arXiv:180507733.
Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. arXiv preprint arXiv:180205438.
Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K. et al. (2017). Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:170605296.
Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J., Whiteson, S. (2018). Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:180311485.
Lazaridou, A., Peysakhovich, A., & Baroni, M. (2016). Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:161207182.
Mordatch, I., & Abbeel, P. (2017). Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:170304908.
Das, A., Kottur, S., Moura, J. M., Lee, S., & Batra, D. (2017). Learning cooperative visual dialog agents with deep reinforcement learning. arXiv preprint arXiv:170306585.
Havrylov, S., & Titov, I. (2017). Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. arXiv preprint arXiv:170511192.
Hernandez-Leal, P., Kaisers, M., Baarslag, T., & de Cote, E. M. (2017). A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:170709183.
Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent q-network. arXiv preprint arXiv:151201693.
Oh, J., Chockalingam, V., Singh, S., & Lee, H. (2016). Control of memory, active perception, and action in minecraft. In Proceedings of The 33rd international conference on machine learning, PMLR, New York, New York, USA, Proceedings of machine learning research (pp. 2790–2799).
Omidshafiei, S., Kim, D. K., Pazis, J., & How, J. P. (2017). Crossmodal attentive skill learner. arXiv preprint arXiv:171110314.
Choi, J., Lee, B. J., & Zhang, B. T. (2017). Multi-focus attention network for efficient deep reinforcement learning. In Workshops at the thirty-first AAAI conference on artificial intelligence.
Geng, M., Xu, K., Zhou, X., Ding, B., Wang, H., & Zhang, L. (2019). Learning to cooperate via an attention-based communication neural network in decentralized multi-robot exploration. Entropy, 21(3), 294.
Article Google Scholar
Iqbal, S., & Sha, F. (2018). Actor-attention-critic for multi-agent reinforcement learning. arXiv preprint arXiv:181002912.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
Article MathSciNet Google Scholar
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Article Google Scholar
Schaul, T., Horgan, D., Gregor, K., & Silver, D. (2015). Universal value function approximators. In International conference on machine learning (pp. 1312–1320).
Albrecht, S. V., & Stone, P. (2018). Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258, 66–95.
Article MathSciNet Google Scholar
He, H., Boyd-Graber, J., Kwok, K., & Daumé III, H. (2016). Opponent modeling in deep reinforcement learning. In International conference on machine learning (pp. 1804–1813).
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. In: Proceedings of the 33nd international conference on machine learning, ICML 2016 (pp. 1995–2003).
Kandula, S., Katabi, D., Davie, B., & Charny, A. (2005). Walking the tightrope: Responsive yet stable traffic engineering. ACM SIGCOMM Computer Communication Review, 35, 253–264.
Article Google Scholar
Mataric, M. J. (1994). Reward functions for accelerated learning. In Machine learning proceedings 1994 (pp. 181–189). New York: Elsevier.
Ha, D., & Schmidhuber, J. (2018). World models. arXiv preprint arXiv:180310122.
Chockalingam, V., Sung, T. T. K., Behbahani, F., Gargeya, R., Sivanantham, A., & Malysheva, A. (2018). Extending world models for multi-agent reinforcement learning in malmö. In Joint Proceedings of the AIIDE 2018 Workshops co-located with 14th AAAI conference on artificial intelligence and interactive digital entertainment (AIIDE 2018). http://ceur-ws.org/Vol-2282/MARLO_110.pdf.
Andreas, J., Dragan, A., & Klein, D. (2017). Translating neuralese. arXiv preprint arXiv:170406960.
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:171010903.
Lee, J. B., Rossi, R. A., Kim, S., Ahmed, N. K., & Koh, E. (2018). Attention models in graphs: A survey. arXiv preprint arXiv:180707984.
Wang, T., Liao, R., Ba, J., & Fidler, S. (2018). Nervenet: Learning structured policy with graph neural networks. In International conference on learning representations. https://openreview.net/forum?id=S1sqHMZCb.
Jiang, J., Dun, C., & Lu, Z. (2018). Graph convolutional reinforcement learning for multi-agent cooperation. arXiv preprint arXiv:181009202.

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their comments. This work was supported by the National Natural Science Foundation of China under Grant No. 61872397.

Author information

Authors and Affiliations

Department of Computer Science, Peking University, Beijing, People’s Republic of China
Hangyu Mao, Zhengchao Zhang, Zhen Xiao & Yan Ni
Huawei Technologies Co., Ltd., Beijing, People’s Republic of China
Zhibo Gong

Authors

Hangyu Mao
View author publications
You can also search for this author in PubMed Google Scholar
Zhengchao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Zhibo Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yan Ni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hangyu Mao or Zhen Xiao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: The hyperparameters

See Tables 3, 4 and 5.

Table 3 The hyperparameters used in cooperative navigation tasks

Full size table

Table 4 The hyperparameters used in packet routing tasks

Full size table

Table 5 The hyperparameters used in traffic control tasks

Full size table

Appendix 2: The layer merging method

This section introduces the layer-merging method that transforms the multi-dimensional action conditional Q-values into scalar Q-values.

Originally, we want to implement the following equation (here, we take $K=3$ as an example):

$$\begin{aligned} Q_i = w_1 Q_{i1} + w_2 Q_{i2} + w_3 Q_{i3} \end{aligned}$$

(13)

where $Q_i$ is the real Q-value used in Bellman equations, $Q_{ik}$ is the k-th Q-value head, and $w_k$ is the weight of each Q-value head, respectively. Note that in the equation, $Q_{ik}$ is a scalar. This is the original naive idea.

However, as mentioned in the main paper, in our real implementation, the action conditional Q-values$Q^{k}_{i}$ (i.e., the Q-value heads) and the contextual Q-value$Q^{c}_{i}$ are 32D vectors that mimic the scalar Q-value. To generate the real Q-value$Q_{i}$ used in Bellman equations, we further add a fully-connected layer, which has one output node representing the real Q-value$Q_{i}$, after the contextual Q-value$Q^{c}_{i}$.

With the above precondition, we use the variables in the main paper (i.e., in our real implementation) to calculate$\varvec{w}_{\varvec{k}}$and$\varvec{Q}_{\varvec{ik}}$ (and accordingly, $\varvec{Q}_{\varvec{i}}$) in Eq. (13) in a suitable way.

Recall that, in the real implementation, $Q_i$ is generated using:

$$\begin{aligned} Q_i = l^1 {Q^{c}_{i}}^1 + l^2 {Q^{c}_{i}}^2 + \cdots + l^{32} {Q^{c}_{i}}^{32} \end{aligned}$$

(14)

where $Q^{c}_{i}$ is the 32D contextual Q-value, ${Q^{c}_{i}}^{m}$ is the m-th element of $Q^{c}_{i}$, and $l^m$ is the m-th network weight which links ${Q^{c}_{i}}^{m}$ and $Q_i$. Note that $l^m$ is a scalar, and the last layer of the critic network can be denoted as $L=[l^1, l^2, \ldots , l^{32}]$

Recall that, in the real implementation, the contextual Q-value$Q^{c}_{i}$ is generated using:

$$\begin{aligned} Q^{c}_{i} = W^1 Q^{1}_{i} + W^2 Q^{2}_{i} + W^3 Q^{3}_{i} \end{aligned}$$

(15)

where $Q^{k}_{i}$ is the action conditional Q-values (each of which is a 32D vector), and $W=\langle W^1, W^2, W^3 \rangle$ is the learned attention weight where $W^1 + W^2 + W^3 = 1$. Note that $W^k$ is a scalar. Accordingly, the m-th element of $Q^{c}_{i}$ is generated using:

$$\begin{aligned} {Q^{c}_{i}}^{m} = W^1 {[Q^{1}_{i}]}^{m} + W^2 {[Q^{2}_{i}]}^{m} + W^3 {[Q^{3}_{i}]}^{m} \end{aligned}$$

(16)

Then we can rewrite Eq. (14) as:

$$\begin{aligned} Q_i&= l^1 {Q^{c}_{i}}^1 + l^2 {Q^{c}_{i}}^2 + \cdots + l^{32} {Q^{c}_{i}}^{32} \nonumber \\&= l^1(W^1 {[Q^{1}_{i}]}^{1} + W^2 {[Q^{2}_{i}]}^{1} + W^3 {[Q^{3}_{i}]}^{1}) \cdots \nonumber \\&\quad + l^{32}(W^1 {[Q^{1}_{i}]}^{32} + W^2 {[Q^{2}_{i}]}^{32} + W^3 {[Q^{3}_{i}]}^{32}) \end{aligned}$$

(17)

$$\begin{aligned}&= W^1(l^1 {[Q^{1}_{i}]}^{1} + l^2 {[Q^{1}_{i}]}^{2} + \cdots + l^{32} {[Q^{1}_{i}]}^{32}) \nonumber \\&\quad + W^2(l^1 {[Q^{2}_{i}]}^{1} + l^2 {[Q^{2}_{i}]}^{2} + \cdots + l^{32} {[Q^{2}_{i}]}^{32}) \nonumber \\&\quad + W^3(l^1 {[Q^{3}_{i}]}^{1} + l^2 {[Q^{3}_{i}]}^{2} + \cdots + l^{32} {[Q^{3}_{i}]}^{32}) \end{aligned}$$

(18)

Comparing Eq. (18) with Eq. (13), we can calculate $w_k$ and $Q_{ik}$ using:

$$\begin{aligned} w_1 &= W^1 \end{aligned}$$

(19)

$$\begin{aligned} w_2 &= W^2 \end{aligned}$$

(20)

$$\begin{aligned} w_3 &= W^3 \end{aligned}$$

(21)

$$\begin{aligned} Q_{i1} &= l^1 {[Q^{1}_{i}]}^{1} + l^2 {[Q^{1}_{i}]}^{2} + \cdots + l^{32} {[Q^{1}_{i}]}^{32} \end{aligned}$$

(22)

$$\begin{aligned} Q_{i2} &= l^1 {[Q^{2}_{i}]}^{1} + l^2 {[Q^{2}_{i}]}^{2} + \cdots + l^{32} {[Q^{2}_{i}]}^{32} \end{aligned}$$

(23)

$$\begin{aligned} Q_{i3} &= l^1 {[Q^{3}_{i}]}^{1} + l^2 {[Q^{3}_{i}]}^{2} + \cdots + l^{32} {[Q^{3}_{i}]}^{32} \end{aligned}$$

(24)

As can be seen from the above equations, this method directly connects the action conditional Q-values$Q^k_i$ with the last layer of the critic network L to transform the multi-dimensional Q-value into scalar Q-value. It can be seen as a layer merging method. We expect that the above analysis is acceptable.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mao, H., Zhang, Z., Xiao, Z. et al. Learning multi-agent communication with double attentional deep reinforcement learning. Auton Agent Multi-Agent Syst 34, 32 (2020). https://doi.org/10.1007/s10458-020-09455-w

Download citation

Published: 25 March 2020
DOI: https://doi.org/10.1007/s10458-020-09455-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning multi-agent communication with double attentional deep reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Learning controlled and targeted communication with the centralized critic for the multi-agent system

A survey of multi-agent deep reinforcement learning with communication

ACM: Learning Dynamic Multi-agent Cooperation via Attentional Communication Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendices

Appendix 1: The hyperparameters

Appendix 2: The layer merging method

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning multi-agent communication with double attentional deep reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Learning controlled and targeted communication with the centralized critic for the multi-agent system

A survey of multi-agent deep reinforcement learning with communication

ACM: Learning Dynamic Multi-agent Cooperation via Attentional Communication Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendices

Appendix 1: The hyperparameters

Appendix 2: The layer merging method

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation