Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via Trust Region Decomposition,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via Trust Region Decomposition
arXiv - CS - Computer Science and Game Theory Pub Date : 2021-02-21 , DOI: arxiv-2102.10616
Wenhao Li, Xiangfeng Wang, Bo Jin, Junjie Sheng, Hongyuan Zha

Non-stationarity is one thorny issue in multi-agent reinforcement learning, which is caused by the policy changes of agents during the learning procedure. Current works to solve this problem have their own limitations in effectiveness and scalability, such as centralized critic and decentralized actor (CCDA), population-based self-play, modeling of others and etc. In this paper, we novelly introduce a $\delta$-stationarity measurement to explicitly model the stationarity of a policy sequence, which is theoretically proved to be proportional to the joint policy divergence. However, simple policy factorization like mean-field approximation will mislead to larger policy divergence, which can be considered as trust region decomposition dilemma. We model the joint policy as a general Markov random field and propose a trust region decomposition network based on message passing to estimate the joint policy divergence more accurately. The Multi-Agent Mirror descent policy algorithm with Trust region decomposition, called MAMT, is established with the purpose to satisfy $\delta$-stationarity. MAMT can adjust the trust region of the local policies adaptively in an end-to-end manner, thereby approximately constraining the divergence of joint policy to alleviate the non-stationary problem. Our method can bring noticeable and stable performance improvement compared with baselines in coordination tasks of different complexity.

中文翻译：

通过信任区域分解处理多智能体强化学习中的非平稳性

非平稳性是多智能体强化学习中一个棘手的问题，它是由学习过程中智能体的策略变化引起的。当前解决该问题的作品在有效性和可伸缩性方面都有其自身的局限性，例如集中评论家和分散演员（CCDA），基于人群的自我玩耍，他人建模等。在本文中，我们新颖地介绍了$ \ delta $-平稳度度量用于显式建模政策序列的平稳度，理论上证明这与联合政策分歧成正比。但是，简单的策略分解（例如均值场近似）将误导更大的策略分歧，这可以视为信任区分解的难题。我们将联合策略建模为一个通用的马尔可夫随机场，并基于消息传递提出了一个信任区域分解网络，以更准确地估计联合策略的差异。建立具有信任区域分解的多代理镜像下降策略算法，称为MAMT，其目的是满足$ \ delta $-平稳性。MAMT可以以端到端的方式自适应地调整本地策略的信任区域，从而大致限制联合策略的差异，以缓解非平稳问题。与不同复杂度的协调任务中的基准相比，我们的方法可以带来显着且稳定的性能改进。建立满足$ \ delta $-平稳性的目的。MAMT可以以端到端的方式自适应地调整本地策略的信任区域，从而大致限制联合策略的差异，以缓解非平稳问题。与不同复杂度的协调任务中的基准相比，我们的方法可以带来显着且稳定的性能改进。建立满足$ \ delta $-平稳性的目的。MAMT可以以端到端的方式自适应地调整本地策略的信任区域，从而大致限制联合策略的差异，以缓解非平稳问题。与不同复杂度的协调任务中的基准相比，我们的方法可以带来显着且稳定的性能改进。

更新日期：2021-02-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>