Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning,arXiv - CS - Multiagent Systems

当前位置： X-MOL 学术 › arXiv.cs.MA › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
arXiv - CS - Multiagent Systems Pub Date : 2020-03-19 , DOI: arxiv-2003.08839
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, Shimon Whiteson

In many real-world settings, a team of agents must coordinate its behaviour while acting in a decentralised fashion. At the same time, it is often possible to train the agents in a centralised fashion where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a mixing network that estimates joint action-values as a monotonic combination of per-agent values. We structurally enforce that the joint-action value is monotonic in the per-agent values, through the use of non-negative weights in the mixing network, which guarantees consistency between the centralised and decentralised policies. To evaluate the performance of QMIX, we propose the StarCraft Multi-Agent Challenge (SMAC) as a new benchmark for deep multi-agent reinforcement learning. We evaluate QMIX on a challenging set of SMAC scenarios and show that it significantly outperforms existing multi-agent reinforcement learning methods.

中文翻译：

深度多智能体强化学习的单调值函数分解

在许多现实世界中，一组代理必须在以分散的方式行事的同时协调其行为。同时，在全局状态信息可用且通信约束解除的情况下，通常可以以集中方式训练代理。学习以额外状态信息为条件的联合行动值是一种利用集中学习的有吸引力的方式，但提取分散策略的最佳策略尚不清楚。我们的解决方案是 QMIX，这是一种新颖的基于价值的方法，可以以集中的端到端方式训练分散的策略。QMIX 采用混合网络，将联合动作值估计为每个代理值的单调组合。我们在结构上强制执行联合动作值在每个代理值中是单调的，通过在混合网络中使用非负权重，保证集中和分散策略之间的一致性。为了评估 QMIX 的性能，我们建议将星际争霸多智能体挑战 (SMAC) 作为深度多智能体强化学习的新基准。我们在一组具有挑战性的 SMAC 场景中评估 QMIX，并表明它明显优于现有的多智能体强化学习方法。

更新日期：2020-09-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>