当前位置: X-MOL 学术Auton. Agent. Multi-Agent Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient policy detecting and reusing for non-stationarity in Markov games
Autonomous Agents and Multi-Agent Systems ( IF 1.9 ) Pub Date : 2020-10-26 , DOI: 10.1007/s10458-020-09480-9
Yan Zheng , Jianye Hao , Zongzhang Zhang , Zhaopeng Meng , Tianpei Yang , Yanran Li , Changjie Fan

One challenging problem in multiagent systems is to cooperate or compete with non-stationary agents that change behavior from time to time. An agent in such a non-stationary environment is usually supposed to be able to quickly detect the other agents’ policy during online interaction, and then adapt its own policy accordingly. This article studies efficient policy detecting and reusing techniques when playing against non-stationary agents in cooperative or competitive Markov games. We propose a new deep Bayesian policy reuse algorithm, a.k.a. DPN-BPR+, by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the rectified belief model taking advantage of the opponent model to infer the other agents’ policy from reward signals and its behavior. Instead of directly storing individual policies as BPR+, we introduce distilled policy network that serves as the policy library, and policy distillation to achieve efficient online policy learning and reuse. DPN-BPR+ inherits all the advantages of BPR+. In experiments, we evaluate DPN-BPR+ in terms of detection accuracy, cumulative reward and speed of convergence in four complex Markov games with raw visual inputs, including two cooperative games and two competitive games. Empirical results show that our proposed DPN-BPR+ approach has better performance than existing algorithms in all these Markov games.



中文翻译:

马尔可夫游戏中非平稳性的有效策略检测和重用

多主体系统中的一个具有挑战性的问题是与不时改变行为的非平稳性代理合作或竞争。通常认为,在这种非固定环境中的代理可以在联机交互期间快速检测到其他代理的策略,然后相应地调整其自身的策略。本文研究了在合作或竞争性马尔可夫博弈中与非平稳主体对战时的有效策略检测和重用技术。通过使用神经网络作为值函数逼近器扩展最新的BPR +算法,我们提出了一种新的深贝叶斯策略重用算法,即DPN-BPR +。为了准确地检测政策,我们提出了利用对手模型修正信念模型从奖励信号及其行为中推断其他代理商的政策。与其直接将单个策略存储为BPR +,我们引入了作为策略库的精简策略网络和策略精简,以实现有效的在线策略学习和重用。DPN-BPR +继承了BPR +的所有优点。在实验中,我们在四个具有原始视觉输入的复杂马尔可夫游戏(包括两个合作游戏和两个竞争游戏)中,根据检测准确性,累积奖励和收敛速度评估了DPN-BPR +。实验结果表明,在所有这些Markov游戏中,我们提出的DPN-BPR +方法具有比现有算法更好的性能。

更新日期:2020-10-30
down
wechat
bug