Abstract
Communication is a critical factor for the big multi-agent world to stay organized and productive. Recently, Deep Reinforcement Learning (DRL) has been adopted to learn the communication among multiple intelligent agents. However, in terms of the DRL setting, the increasing number of communication messages introduces two problems: (1) there are usually some redundant messages; (2) even in the case that all messages are necessary, how to process a large number of messages in an efficient way remains a big challenge. In this paper, we propose a DRL method named Double Attentional Actor-Critic Message Processor (DAACMP) to jointly address these two problems. Specifically, DAACMP adopts two attention mechanisms. The first one is embedded in the actor part, such that it can select the important messages from all communication messages adaptively. The other one is embedded in the critic part so that all important messages can be processed efficiently. We evaluate DAACMP on three multi-agent tasks with seven different settings. Results show that DAACMP not only outperforms several state-of-the-art methods but also achieves better scalability in all tasks. Furthermore, we conduct experiments to reveal some insights about the proposed attention mechanisms and the learned policies.
Similar content being viewed by others
Notes
It is a modification of our ACML [17] accepted by AAAI-2020.
It is the same as that of our ATT-MADDPG [18] accepted by AAMAS-2019.
Formally, \(P(s', r_i|\mathbf {o},\mathbf {a},{\varvec{\pi }}) = P(s', r_i|s,a_1,\ldots ,a_N,\pi _1,\ldots ,\pi _N) = P(s', r_i|s,a_1,\ldots ,a_N) = P(s', r_i|s,a_1,\ldots ,a_N,\pi '_1,\ldots ,\pi '_N)\) for any πi ≠ π′i. Please refer MADDPG [33] for details.
The detailed derivation can be found in [56].
The expectation is equivalent to the weighted summation, and the weight of \(Q_{i}^{\pi _i}(s,a_i,\mathbf {a}_{-i})\) is \({\varvec{\pi }}_{-i}(\mathbf {a}_{-i}|s)\) as shown in Eq. (10).
This is why we use \(Q_{i}^{k}(s,a_i|\mathbf {a}_{-i};w_i)\) instead of \(Q_{i}^{k}(s,a_i,\mathbf {a}_{-i};w_i)\) to represent the defined action conditional Q-value.
Please note that \(M_i\) is a weighted summation of all other local messages \(m_{j \wedge j \ne i}\), while \(m_j\) is an encoding of \(o_j\). Therefore, \([m_i|M_i]\) has all the necessary information contained in \(\langle o_{i}, \mathbf {o}_{-i} \rangle\), which means that the shared representation learning will not lose important information about \(\langle o_{i}, \mathbf {o}_{-i} \rangle\) if the model is well-trained. In contrast, it can bring many benefits, e.g., data efficiency, robust training, and so on.
The detailed advantages of minimizing MLU are discussed in [58].
There are two exceptions. The first one is that ACMP-AA underperforms ACMP on the cooperative navigation task when \(N=2\). The other one is that ACMP-AA underperforms ACMP on traffic control tasks. As analyzed before, the reason of the former exception is that this setting is too simple to leave space for advanced methods to improve on, while the reason of the latter exception is that traffic control task has random biases going against the property of ACMP-AA.
Recall that the 2D plane is bounded. The agent’s next position is calculated by \(p_{t+1} = \langle (p_x+v_x)\%10, (p_y+v_y)\%10 \rangle\).
References
Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning (Vol. 135). Cambridge: MIT Press.
Tan, M. (1993). Multi-agent reinforcement learning: Independent versus cooperative agents. In Proceedings of the tenth international conference on machine learning (pp. 330–337).
Wu, F., Zilberstein, S., & Chen, X. (2011). Online planning for multi-agent systems with bounded communication. Artificial Intelligence, 175(2), 487–511.
Zhang, C., & Lesser, V. (2013). Coordinating multi-agent reinforcement learning with limited communication. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, international foundation for autonomous agents and multiagent systems (pp. 1101–1108).
Roth, M., Simmons, R., & Veloso, M. (2005). Reasoning about joint beliefs for execution-time communication decisions. In Proceedings of the fourth international joint conference on autonomous agents and multiagent systems, ACM (pp. 786–793).
Roth, M., Simmons, R., & Veloso, M. (2006). What to communicate? Execution-time decision in multi-agent pomdps. In Distributed autonomous robotic systems (Vol. 7, pp. 177–186). Berlin: Springer.
Sukhbaatar, S., Fergus, R., et al. (2016). Learning multiagent communication with backpropagation. In Advances in neural information processing systems (pp. 2244–2252).
Foerster, J., Assael, Y. M., de Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. In Advances in neural information processing systems (pp 2137–2145).
Peng, P., Yuan, Q., Wen, Y., Yang, Y., Tang, Z., Long, H., & Wang, J. (2017). Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:170310069.
Mao, H., Gong, Z., Ni, Y., & Xiao, Z. (2017). Accnet: Actor-coordinator-critic net for “learning-to-communicate” with deep multi-agent reinforcement learning. arXiv preprint arXiv:170603235.
Kong, X., Xin, B., Liu, F., & Wang, Y. (2017). Revisiting the master-slave architecture in multi-agent deep reinforcement learning. arXiv preprint arXiv:171207305.
Kilinc, O., & Montana, G. (2019). Multi-agent deep reinforcement learning with extremely noisy observations. In International conference on learning representations.
Kim, D., Moon, S., Hostallero, D., Kang, W. J., Lee, T., Son, K., & Yi, Y. (2019). Learning to schedule communication in multi-agent reinforcement learning. In International conference on learning representations. https://openreview.net/forum?id=SJxu5iR9KQ.
Singh, A., Jain, T., & Sukhbaatar, S. (2019). Individualized controlled continuous communication model for multiagent cooperative and competitive tasks. In International conference on learning representations. https://openreview.net/forum?id=rye7knCqK7.
Kim, W., Cho, M., & Sung, Y. (2019). Message-dropout: An efficient training method for multi-agent deep reinforcement learning. arXiv preprint arXiv:190206527.
Mao, H., Gong, Z., Zhang, Z., Xiao, Z., & Ni, Y. (2019). Learning multi-agent communication under limited-bandwidth restriction for internet packet routing. arXiv preprint arXiv:190305561.
Mao, H., Zhang, Z., Xiao, Z., Gong, Z., & Ni, Y. (2020). Learning agent communication under limited bandwidth by message pruning. In AAAI 2020.
Mao, H., Zhang, Z., Xiao, Z., & Gong, Z. (2019). Modelling the dynamic joint policy of teammates with attention multi-agent DDPG. In Proceedings of the 18th international joint conference on autonomous agents and multiagent systems, ACM.
Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of MDP. Mathematics of Operations Research, 27(4), 819–840.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Konda, V. R., & Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in neural information processing systems (pp. 1008–1014).
Konda, V. R., & Tsitsiklis, J. N. (2003). On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4), 1143–1166.
Grondman, I., Busoniu, L., Lopes, G. A., & Babuska, R. (2012). A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 1291–1307.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In ICML.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971.
Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204–2212).
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057).
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:150804025.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
Pynadath, D. V., & Tambe, M. (2002). The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of Artificial Intelligence Research, 16, 389–423.
Goldman, C. V., & Zilberstein, S. (2004). Decentralized control of cooperative systems: Categorization and complexity analysis. Journal of Artificial Intelligence Research, 22, 143–174.
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems (pp. 6379–6390).
Chu, X., & Ye, H. (2017). Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:171000336.
Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., & Whiteson, S. (2017). Counterfactual multi-agent policy gradients. arXiv preprint arXiv:170508926.
Peng, Z., Zhang, L., & Luo, T. (2018). Learning to communicate via supervised attentional message processing. In Proceedings of the 31st international conference on computer animation and social agents, ACM (pp. 11–16).
Jiang, J., & Lu, Z. (2018). Learning attentional communication for multi-agent cooperation. arXiv preprint arXiv:180507733.
Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. arXiv preprint arXiv:180205438.
Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K. et al. (2017). Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:170605296.
Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J., Whiteson, S. (2018). Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:180311485.
Lazaridou, A., Peysakhovich, A., & Baroni, M. (2016). Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:161207182.
Mordatch, I., & Abbeel, P. (2017). Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:170304908.
Das, A., Kottur, S., Moura, J. M., Lee, S., & Batra, D. (2017). Learning cooperative visual dialog agents with deep reinforcement learning. arXiv preprint arXiv:170306585.
Havrylov, S., & Titov, I. (2017). Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. arXiv preprint arXiv:170511192.
Hernandez-Leal, P., Kaisers, M., Baarslag, T., & de Cote, E. M. (2017). A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:170709183.
Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent q-network. arXiv preprint arXiv:151201693.
Oh, J., Chockalingam, V., Singh, S., & Lee, H. (2016). Control of memory, active perception, and action in minecraft. In Proceedings of The 33rd international conference on machine learning, PMLR, New York, New York, USA, Proceedings of machine learning research (pp. 2790–2799).
Omidshafiei, S., Kim, D. K., Pazis, J., & How, J. P. (2017). Crossmodal attentive skill learner. arXiv preprint arXiv:171110314.
Choi, J., Lee, B. J., & Zhang, B. T. (2017). Multi-focus attention network for efficient deep reinforcement learning. In Workshops at the thirty-first AAAI conference on artificial intelligence.
Geng, M., Xu, K., Zhou, X., Ding, B., Wang, H., & Zhang, L. (2019). Learning to cooperate via an attention-based communication neural network in decentralized multi-robot exploration. Entropy, 21(3), 294.
Iqbal, S., & Sha, F. (2018). Actor-attention-critic for multi-agent reinforcement learning. arXiv preprint arXiv:181002912.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Schaul, T., Horgan, D., Gregor, K., & Silver, D. (2015). Universal value function approximators. In International conference on machine learning (pp. 1312–1320).
Albrecht, S. V., & Stone, P. (2018). Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258, 66–95.
He, H., Boyd-Graber, J., Kwok, K., & Daumé III, H. (2016). Opponent modeling in deep reinforcement learning. In International conference on machine learning (pp. 1804–1813).
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. In: Proceedings of the 33nd international conference on machine learning, ICML 2016 (pp. 1995–2003).
Kandula, S., Katabi, D., Davie, B., & Charny, A. (2005). Walking the tightrope: Responsive yet stable traffic engineering. ACM SIGCOMM Computer Communication Review, 35, 253–264.
Mataric, M. J. (1994). Reward functions for accelerated learning. In Machine learning proceedings 1994 (pp. 181–189). New York: Elsevier.
Ha, D., & Schmidhuber, J. (2018). World models. arXiv preprint arXiv:180310122.
Chockalingam, V., Sung, T. T. K., Behbahani, F., Gargeya, R., Sivanantham, A., & Malysheva, A. (2018). Extending world models for multi-agent reinforcement learning in malmö. In Joint Proceedings of the AIIDE 2018 Workshops co-located with 14th AAAI conference on artificial intelligence and interactive digital entertainment (AIIDE 2018). http://ceur-ws.org/Vol-2282/MARLO_110.pdf.
Andreas, J., Dragan, A., & Klein, D. (2017). Translating neuralese. arXiv preprint arXiv:170406960.
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:171010903.
Lee, J. B., Rossi, R. A., Kim, S., Ahmed, N. K., & Koh, E. (2018). Attention models in graphs: A survey. arXiv preprint arXiv:180707984.
Wang, T., Liao, R., Ba, J., & Fidler, S. (2018). Nervenet: Learning structured policy with graph neural networks. In International conference on learning representations. https://openreview.net/forum?id=S1sqHMZCb.
Jiang, J., Dun, C., & Lu, Z. (2018). Graph convolutional reinforcement learning for multi-agent cooperation. arXiv preprint arXiv:181009202.
Acknowledgements
The authors would like to thank the anonymous reviewers for their comments. This work was supported by the National Natural Science Foundation of China under Grant No. 61872397.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: The hyperparameters
Appendix 2: The layer merging method
This section introduces the layer-merging method that transforms the multi-dimensional action conditional Q-values into scalar Q-values.
Originally, we want to implement the following equation (here, we take \(K=3\) as an example):
where \(Q_i\) is the real Q-value used in Bellman equations, \(Q_{ik}\) is the k-th Q-value head, and \(w_k\) is the weight of each Q-value head, respectively. Note that in the equation, \(Q_{ik}\) is a scalar. This is the original naive idea.
However, as mentioned in the main paper, in our real implementation, the action conditional Q-values\(Q^{k}_{i}\) (i.e., the Q-value heads) and the contextual Q-value\(Q^{c}_{i}\) are 32D vectors that mimic the scalar Q-value. To generate the real Q-value\(Q_{i}\) used in Bellman equations, we further add a fully-connected layer, which has one output node representing the real Q-value\(Q_{i}\), after the contextual Q-value\(Q^{c}_{i}\).
With the above precondition, we use the variables in the main paper (i.e., in our real implementation) to calculate\(\varvec{w}_{\varvec{k}}\)and\(\varvec{Q}_{\varvec{ik}}\) (and accordingly, \(\varvec{Q}_{\varvec{i}}\)) in Eq. (13) in a suitable way.
Recall that, in the real implementation, \(Q_i\) is generated using:
where \(Q^{c}_{i}\) is the 32D contextual Q-value, \({Q^{c}_{i}}^{m}\) is the m-th element of \(Q^{c}_{i}\), and \(l^m\) is the m-th network weight which links \({Q^{c}_{i}}^{m}\) and \(Q_i\). Note that \(l^m\) is a scalar, and the last layer of the critic network can be denoted as \(L=[l^1, l^2, \ldots , l^{32}]\)
Recall that, in the real implementation, the contextual Q-value\(Q^{c}_{i}\) is generated using:
where \(Q^{k}_{i}\) is the action conditional Q-values (each of which is a 32D vector), and \(W=\langle W^1, W^2, W^3 \rangle\) is the learned attention weight where \(W^1 + W^2 + W^3 = 1\). Note that \(W^k\) is a scalar. Accordingly, the m-th element of \(Q^{c}_{i}\) is generated using:
Then we can rewrite Eq. (14) as:
Comparing Eq. (18) with Eq. (13), we can calculate \(w_k\) and \(Q_{ik}\) using:
As can be seen from the above equations, this method directly connects the action conditional Q-values\(Q^k_i\) with the last layer of the critic network L to transform the multi-dimensional Q-value into scalar Q-value. It can be seen as a layer merging method. We expect that the above analysis is acceptable.
Rights and permissions
About this article
Cite this article
Mao, H., Zhang, Z., Xiao, Z. et al. Learning multi-agent communication with double attentional deep reinforcement learning. Auton Agent Multi-Agent Syst 34, 32 (2020). https://doi.org/10.1007/s10458-020-09455-w
Published:
DOI: https://doi.org/10.1007/s10458-020-09455-w