Efficient policy detecting and reusing for non-stationarity in Markov games

Zheng, Yan; Hao, Jianye; Zhang, Zongzhang; Meng, Zhaopeng; Yang, Tianpei; Li, Yanran; Fan, Changjie

doi:10.1007/s10458-020-09480-9

Efficient policy detecting and reusing for non-stationarity in Markov games

Published: 26 October 2020

Volume 35, article number 2, (2021)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Yan Zheng ORCID: orcid.org/0000-0003-2741-058X^1,2,
Jianye Hao¹,
Zongzhang Zhang³,
Zhaopeng Meng¹,
Tianpei Yang¹,
Yanran Li⁴ &
…
Changjie Fan⁵

957 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

One challenging problem in multiagent systems is to cooperate or compete with non-stationary agents that change behavior from time to time. An agent in such a non-stationary environment is usually supposed to be able to quickly detect the other agents’ policy during online interaction, and then adapt its own policy accordingly. This article studies efficient policy detecting and reusing techniques when playing against non-stationary agents in cooperative or competitive Markov games. We propose a new deep Bayesian policy reuse algorithm, a.k.a. DPN-BPR+, by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the rectified belief model taking advantage of the opponent model to infer the other agents’ policy from reward signals and its behavior. Instead of directly storing individual policies as BPR+, we introduce distilled policy network that serves as the policy library, and policy distillation to achieve efficient online policy learning and reuse. DPN-BPR+ inherits all the advantages of BPR+. In experiments, we evaluate DPN-BPR+ in terms of detection accuracy, cumulative reward and speed of convergence in four complex Markov games with raw visual inputs, including two cooperative games and two competitive games. Empirical results show that our proposed DPN-BPR+ approach has better performance than existing algorithms in all these Markov games.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games

Article 16 July 2021

Toward competitive multi-agents in Polo game based on reinforcement learning

Article 09 May 2021

Towards a Fast Detection of Opponents in Repeated Stochastic Games

Notes

Strictly speaking, \(p(\tau )\) is a probability only after normalization, which will be performed together with the vanilla belief model \(\beta (\tau )\) (see \(\eta \) in Eq. (13)).

References

Albrecht, S. V., & Stone, P. (2018). Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258, 66–95.
Article MathSciNet Google Scholar
Banerjee, T., Liu, M., & How, J. P. (2017). Quickest change detection approach to optimal control in Markov decision processes with model changes. In 2017 American control conference (ACC) (pp. 399–405).
Brafman, R. I., & Tennenholtz, M. (2003). R-max—A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213–231.
MathSciNet MATH Google Scholar
Chalkiadakis, G., & Boutilier, C. (2003). Coordination in multiagent reinforcement learning: A Bayesian approach. In Proceedings of the 2nd international conference on autonomous agents and multiagent systems (AAMAS) (pp. 709–716).
Crandall, J. W. (2012). Just add pepper: Extending learning algorithms for repeated matrix games to repeated Markov games. In Proceedings of the 11th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 399–406).
da Silva, B. C., Basso, E. W., Bazzan, A. L. C., & Engel, P. M. (2006). Dealing with non-stationary environments using context detection. In Proceedings of the 23rd international conference on machine learning (ICML) (pp. 217–224).
de Weerd, H., Verbrugge, R., & Verheij, B. (2013). Higher-order theory of mind in negotiations under incomplete information. In Proceedings of the 16th international conference on principles and practice of multi-agent systems (PRIMA) (pp. 101–116).
Foerster, J. N., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., & Mordatch, I. (2018). Learning with opponent-learning awareness. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 122–130).
Gupta, J. K., Egorov, M., & Kochenderfer, M. J. (2017). Cooperative multi-agent control using deep reinforcement learning. In Adaptive learning agents workshop.
Hadoux, E., Beynier, A., & Weng, P. (2014). Sequential decision-making under non-stationary environments via sequential change-point detection. In Learning over multiple contexts (LMCE).
He, H., & Boyd-Graber, J. L. (2016). Opponent modeling in deep reinforcement learning. In Proceedings of the 33rd international conference on machine learning (ICML) (pp. 1804–1813).
Hernandez-Leal, P., de Cote, E. M., & Sucar, L. E. (2014). A framework for learning and planning against switching strategies in repeated games. Adaptive and Learning Agents, 26(2), 103–122.
Google Scholar
Hernandez-Leal, P., & Kaisers, M. (2017). Learning against sequential opponents in repeated stochastic games. In The 3rd multi-disciplinary conference on reinforcement learning and decision making.
Hernandez-Leal, P., & Kaisers, M. (2017). Towards a fast detection of opponents in repeated stochastic games. In Proceedings of the 16th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 239–257).
Hernandez-Leal, P., Kaisers, M., Baarslag, T., & de Cote, E. M. (2017). A survey of learning in multiagent environments: Dealing with non-stationarity. CoRR. arXiv:1707.09183.
Hernandez-Leal, P., Rosman, B., Taylor, M. E., Sucar, L. E., & de Cote, E. M. (2016). A Bayesian approach for learning and tracking switching, non-stationary opponents (extended abstract). In Proceedings of the 15th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 1315–1316).
Hernandez-Leal, P., Zhan, Y., Taylor, M. E., Sucar, L. E., & de Cote, E. M. (2017). Efficiently detecting switches against non-stationary opponents. Autonomous Agents and Multi-Agent Systems, 31(4), 767–789.
Article Google Scholar
Hinton, G. E., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. CoRR. arXiv:1503.02531.
Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. Advances in Neural Information Processing Systems, 29, 4565–4573.
Google Scholar
Hong, Z., Su, S., Shann, T., Chang, Y., & Lee, C. (2018). A deep policy inference q-network for multi-agent systems. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 1388–1396).
Hu, J., & Wellman, M. P. (1998). Multiagent reinforcement learning: Theoretical framework and an algorithm. In Proceedings of the 15th international conference on machine learning (ICML) (pp. 242–250).
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. In International conference on learning representations (ICLR).
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th international conference on machine learning (ICML) (pp. 157–163).
Lopes, M., Lang, T., Toussaint, M., & Yves Oudeyer, P. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. Advances in Neural Information Processing Systems, 25, 206–214.
Google Scholar
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems, 30, 6382–6393.
Google Scholar
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd international conference on machine learning (ICML) (pp. 1928–1937).
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Anda, M. G., Bellemare, J. V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Article Google Scholar
Palmer, G., Tuyls, K., Bloembergen, D., & Savani, R. (2018). Lenient multi-agent deep reinforcement learning. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 443–451).
Rosman, B., Hawasly, M., & Ramamoorthy, S. (2016). Bayesian policy reuse. Machine Learning, 104(1), 99–127.
Article MathSciNet Google Scholar
Rusu, A. A., Colmenarejo, S. G., Gülçehre, Ç., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., & Hadsell, R. (2015). Policy distillation. CoRR. arXiv:1511.06295.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. In International conference on learning representations (ICLR).
Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., et al. (2017). Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE, 12(4), 1–15. https://doi.org/10.1371/journal.pone.0172395.
Article Google Scholar
Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al. (2017). StarCraft II: A new challenge for reinforcement learning. arXiv:1708.04782.
van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double Q-learning. In Proceedings of the 30th AAAI conference on artificial intelligence (AAAI) (pp. 2094–2100).
von der Osten, F. B., Kirley, M., & Miller, T. (2017). The minds of many: Opponent modeling in a stochastic game. In Proceedings of the 27th international joint conference on artificial intelligence, (IJCAI) (pp. 3845–3851).
Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. Advances in Neural Information Processing Systems, 30, 5320–5329.
Google Scholar
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd international conference on machine learning (ICML) (pp. 1995–2003).
Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1), 69–101.
Google Scholar
Yang, T., Hao, J., Meng, Z., Zheng, Y., Zhang, C., & Zheng, Z. (2019). Bayes-tomop: A fast detection and best response algorithm towards sophisticated opponents. In Proceedings of the 18th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 2282–2284). International Foundation for Autonomous Agents and Multiagent Systems.
Zhao, X., Zhang, L., Ding, Z., Yin, D., Zhao, Y., & Tang, J. (2018). Deep reinforcement learning for list-wise recommendations. CoRR. arXiv:1801.00209.
Zheng, Y., Meng, Z., Hao, J., Zhang, Z., Yang, T., & Fan, C. (2018). A deep Bayesian policy reuse approach against non-stationary agents. Advances in Neural Information Processing Systems, 31, 954–964.
Google Scholar

Download references

Acknowledgements

The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, 61876119), Special Program of Artificial Intelligence, Tianjin Research Program of Application Foundation and Advanced Technology (No.: 16JCQNJC00100), Special Program of Artificial Intelligence of Tianjin Municipal Science and Technology Commission (No.: 569 17ZXRGGX00150), Science and Technology Program of Tianjin, China (Grant Nos. 15PTCYSY00030, 16ZXHLGX00170), and Natural Science Foundation of Jiangsu (No.: BK20181432). We thank Rui Kong and Weijian Liao from Nanjing University for their insightful comments to improve the quality of this paper. We also thank our industrial research partner NetEase, Inc., especially the Fuxi AI Lab for their support in providing the environments.

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Yan Zheng, Jianye Hao, Zhaopeng Meng & Tianpei Yang
School of New Media and Communication, Tianjin University, Tianjin, China
Yan Zheng
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Zongzhang Zhang
College of Management and Economics, Tianjin University, Tianjin, China
Yanran Li
NetEase Fuxi Lab, NetEase, Inc., Hangzhou, China
Changjie Fan

Authors

Yan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jianye Hao
View author publications
You can also search for this author in PubMed Google Scholar
Zongzhang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaopeng Meng
View author publications
You can also search for this author in PubMed Google Scholar
Tianpei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yanran Li
View author publications
You can also search for this author in PubMed Google Scholar
Changjie Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianye Hao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is an extended version of the paper [41] presented at the 32nd Conference on Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, Y., Hao, J., Zhang, Z. et al. Efficient policy detecting and reusing for non-stationarity in Markov games. Auton Agent Multi-Agent Syst 35, 2 (2021). https://doi.org/10.1007/s10458-020-09480-9

Download citation

Published: 26 October 2020
DOI: https://doi.org/10.1007/s10458-020-09480-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient policy detecting and reusing for non-stationarity in Markov games

Abstract

Access this article

Similar content being viewed by others

A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games

Toward competitive multi-agents in Polo game based on reinforcement learning

Towards a Fast Detection of Opponents in Repeated Stochastic Games

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient policy detecting and reusing for non-stationarity in Markov games

Abstract

Access this article

Similar content being viewed by others

A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games

Toward competitive multi-agents in Polo game based on reinforcement learning

Towards a Fast Detection of Opponents in Repeated Stochastic Games

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation