Skip to main content
Log in

Efficient policy detecting and reusing for non-stationarity in Markov games

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

One challenging problem in multiagent systems is to cooperate or compete with non-stationary agents that change behavior from time to time. An agent in such a non-stationary environment is usually supposed to be able to quickly detect the other agents’ policy during online interaction, and then adapt its own policy accordingly. This article studies efficient policy detecting and reusing techniques when playing against non-stationary agents in cooperative or competitive Markov games. We propose a new deep Bayesian policy reuse algorithm, a.k.a. DPN-BPR+, by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the rectified belief model taking advantage of the opponent model to infer the other agents’ policy from reward signals and its behavior. Instead of directly storing individual policies as BPR+, we introduce distilled policy network that serves as the policy library, and policy distillation to achieve efficient online policy learning and reuse. DPN-BPR+ inherits all the advantages of BPR+. In experiments, we evaluate DPN-BPR+ in terms of detection accuracy, cumulative reward and speed of convergence in four complex Markov games with raw visual inputs, including two cooperative games and two competitive games. Empirical results show that our proposed DPN-BPR+ approach has better performance than existing algorithms in all these Markov games.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. Strictly speaking, \(p(\tau )\) is a probability only after normalization, which will be performed together with the vanilla belief model \(\beta (\tau )\) (see \(\eta \) in Eq. (13)).

References

  1. Albrecht, S. V., & Stone, P. (2018). Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258, 66–95.

    Article  MathSciNet  Google Scholar 

  2. Banerjee, T., Liu, M., & How, J. P. (2017). Quickest change detection approach to optimal control in Markov decision processes with model changes. In 2017 American control conference (ACC) (pp. 399–405).

  3. Brafman, R. I., & Tennenholtz, M. (2003). R-max—A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213–231.

    MathSciNet  MATH  Google Scholar 

  4. Chalkiadakis, G., & Boutilier, C. (2003). Coordination in multiagent reinforcement learning: A Bayesian approach. In Proceedings of the 2nd international conference on autonomous agents and multiagent systems (AAMAS) (pp. 709–716).

  5. Crandall, J. W. (2012). Just add pepper: Extending learning algorithms for repeated matrix games to repeated Markov games. In Proceedings of the 11th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 399–406).

  6. da Silva, B. C., Basso, E. W., Bazzan, A. L. C., & Engel, P. M. (2006). Dealing with non-stationary environments using context detection. In Proceedings of the 23rd international conference on machine learning (ICML) (pp. 217–224).

  7. de Weerd, H., Verbrugge, R., & Verheij, B. (2013). Higher-order theory of mind in negotiations under incomplete information. In Proceedings of the 16th international conference on principles and practice of multi-agent systems (PRIMA) (pp. 101–116).

  8. Foerster, J. N., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., & Mordatch, I. (2018). Learning with opponent-learning awareness. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 122–130).

  9. Gupta, J. K., Egorov, M., & Kochenderfer, M. J. (2017). Cooperative multi-agent control using deep reinforcement learning. In Adaptive learning agents workshop.

  10. Hadoux, E., Beynier, A., & Weng, P. (2014). Sequential decision-making under non-stationary environments via sequential change-point detection. In Learning over multiple contexts (LMCE).

  11. He, H., & Boyd-Graber, J. L. (2016). Opponent modeling in deep reinforcement learning. In Proceedings of the 33rd international conference on machine learning (ICML) (pp. 1804–1813).

  12. Hernandez-Leal, P., de Cote, E. M., & Sucar, L. E. (2014). A framework for learning and planning against switching strategies in repeated games. Adaptive and Learning Agents, 26(2), 103–122.

    Google Scholar 

  13. Hernandez-Leal, P., & Kaisers, M. (2017). Learning against sequential opponents in repeated stochastic games. In The 3rd multi-disciplinary conference on reinforcement learning and decision making.

  14. Hernandez-Leal, P., & Kaisers, M. (2017). Towards a fast detection of opponents in repeated stochastic games. In Proceedings of the 16th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 239–257).

  15. Hernandez-Leal, P., Kaisers, M., Baarslag, T., & de Cote, E. M. (2017). A survey of learning in multiagent environments: Dealing with non-stationarity. CoRR. arXiv:1707.09183.

  16. Hernandez-Leal, P., Rosman, B., Taylor, M. E., Sucar, L. E., & de Cote, E. M. (2016). A Bayesian approach for learning and tracking switching, non-stationary opponents (extended abstract). In Proceedings of the 15th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 1315–1316).

  17. Hernandez-Leal, P., Zhan, Y., Taylor, M. E., Sucar, L. E., & de Cote, E. M. (2017). Efficiently detecting switches against non-stationary opponents. Autonomous Agents and Multi-Agent Systems, 31(4), 767–789.

    Article  Google Scholar 

  18. Hinton, G. E., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. CoRR. arXiv:1503.02531.

  19. Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. Advances in Neural Information Processing Systems, 29, 4565–4573.

    Google Scholar 

  20. Hong, Z., Su, S., Shann, T., Chang, Y., & Lee, C. (2018). A deep policy inference q-network for multi-agent systems. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 1388–1396).

  21. Hu, J., & Wellman, M. P. (1998). Multiagent reinforcement learning: Theoretical framework and an algorithm. In Proceedings of the 15th international conference on machine learning (ICML) (pp. 242–250).

  22. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. In International conference on learning representations (ICLR).

  23. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th international conference on machine learning (ICML) (pp. 157–163).

  24. Lopes, M., Lang, T., Toussaint, M., & Yves Oudeyer, P. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. Advances in Neural Information Processing Systems, 25, 206–214.

    Google Scholar 

  25. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems, 30, 6382–6393.

    Google Scholar 

  26. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd international conference on machine learning (ICML) (pp. 1928–1937).

  27. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Anda, M. G., Bellemare, J. V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

    Article  Google Scholar 

  28. Palmer, G., Tuyls, K., Bloembergen, D., & Savani, R. (2018). Lenient multi-agent deep reinforcement learning. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 443–451).

  29. Rosman, B., Hawasly, M., & Ramamoorthy, S. (2016). Bayesian policy reuse. Machine Learning, 104(1), 99–127.

    Article  MathSciNet  Google Scholar 

  30. Rusu, A. A., Colmenarejo, S. G., Gülçehre, Ç., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., & Hadsell, R. (2015). Policy distillation. CoRR. arXiv:1511.06295.

  31. Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. In International conference on learning representations (ICLR).

  32. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., et al. (2017). Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE, 12(4), 1–15. https://doi.org/10.1371/journal.pone.0172395.

    Article  Google Scholar 

  33. Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al. (2017). StarCraft II: A new challenge for reinforcement learning. arXiv:1708.04782.

  34. van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double Q-learning. In Proceedings of the 30th AAAI conference on artificial intelligence (AAAI) (pp. 2094–2100).

  35. von der Osten, F. B., Kirley, M., & Miller, T. (2017). The minds of many: Opponent modeling in a stochastic game. In Proceedings of the 27th international joint conference on artificial intelligence, (IJCAI) (pp. 3845–3851).

  36. Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. Advances in Neural Information Processing Systems, 30, 5320–5329.

    Google Scholar 

  37. Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd international conference on machine learning (ICML) (pp. 1995–2003).

  38. Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1), 69–101.

    Google Scholar 

  39. Yang, T., Hao, J., Meng, Z., Zheng, Y., Zhang, C., & Zheng, Z. (2019). Bayes-tomop: A fast detection and best response algorithm towards sophisticated opponents. In Proceedings of the 18th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 2282–2284). International Foundation for Autonomous Agents and Multiagent Systems.

  40. Zhao, X., Zhang, L., Ding, Z., Yin, D., Zhao, Y., & Tang, J. (2018). Deep reinforcement learning for list-wise recommendations. CoRR. arXiv:1801.00209.

  41. Zheng, Y., Meng, Z., Hao, J., Zhang, Z., Yang, T., & Fan, C. (2018). A deep Bayesian policy reuse approach against non-stationary agents. Advances in Neural Information Processing Systems, 31, 954–964.

    Google Scholar 

Download references

Acknowledgements

The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, 61876119), Special Program of Artificial Intelligence, Tianjin Research Program of Application Foundation and Advanced Technology (No.: 16JCQNJC00100), Special Program of Artificial Intelligence of Tianjin Municipal Science and Technology Commission (No.: 569 17ZXRGGX00150), Science and Technology Program of Tianjin, China (Grant Nos. 15PTCYSY00030, 16ZXHLGX00170), and Natural Science Foundation of Jiangsu (No.: BK20181432). We thank Rui Kong and Weijian Liao from Nanjing University for their insightful comments to improve the quality of this paper. We also thank our industrial research partner NetEase, Inc., especially the Fuxi AI Lab for their support in providing the environments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianye Hao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is an extended version of the paper [41] presented at the 32nd Conference on Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, Y., Hao, J., Zhang, Z. et al. Efficient policy detecting and reusing for non-stationarity in Markov games. Auton Agent Multi-Agent Syst 35, 2 (2021). https://doi.org/10.1007/s10458-020-09480-9

Download citation

  • Published:

  • DOI: https://doi.org/10.1007/s10458-020-09480-9

Keywords

Navigation