当前位置:
X-MOL 学术
›
arXiv.cs.MA
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
arXiv - CS - Multiagent Systems Pub Date : 2020-03-19 , DOI: arxiv-2003.08839 Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, Shimon Whiteson
arXiv - CS - Multiagent Systems Pub Date : 2020-03-19 , DOI: arxiv-2003.08839 Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, Shimon Whiteson
In many real-world settings, a team of agents must coordinate its behaviour
while acting in a decentralised fashion. At the same time, it is often possible
to train the agents in a centralised fashion where global state information is
available and communication constraints are lifted. Learning joint
action-values conditioned on extra state information is an attractive way to
exploit centralised learning, but the best strategy for then extracting
decentralised policies is unclear. Our solution is QMIX, a novel value-based
method that can train decentralised policies in a centralised end-to-end
fashion. QMIX employs a mixing network that estimates joint action-values as a
monotonic combination of per-agent values. We structurally enforce that the
joint-action value is monotonic in the per-agent values, through the use of
non-negative weights in the mixing network, which guarantees consistency
between the centralised and decentralised policies. To evaluate the performance
of QMIX, we propose the StarCraft Multi-Agent Challenge (SMAC) as a new
benchmark for deep multi-agent reinforcement learning. We evaluate QMIX on a
challenging set of SMAC scenarios and show that it significantly outperforms
existing multi-agent reinforcement learning methods.
中文翻译:
深度多智能体强化学习的单调值函数分解
在许多现实世界中,一组代理必须在以分散的方式行事的同时协调其行为。同时,在全局状态信息可用且通信约束解除的情况下,通常可以以集中方式训练代理。学习以额外状态信息为条件的联合行动值是一种利用集中学习的有吸引力的方式,但提取分散策略的最佳策略尚不清楚。我们的解决方案是 QMIX,这是一种新颖的基于价值的方法,可以以集中的端到端方式训练分散的策略。QMIX 采用混合网络,将联合动作值估计为每个代理值的单调组合。我们在结构上强制执行联合动作值在每个代理值中是单调的,通过在混合网络中使用非负权重,保证集中和分散策略之间的一致性。为了评估 QMIX 的性能,我们建议将星际争霸多智能体挑战 (SMAC) 作为深度多智能体强化学习的新基准。我们在一组具有挑战性的 SMAC 场景中评估 QMIX,并表明它明显优于现有的多智能体强化学习方法。
更新日期:2020-09-22
中文翻译:
深度多智能体强化学习的单调值函数分解
在许多现实世界中,一组代理必须在以分散的方式行事的同时协调其行为。同时,在全局状态信息可用且通信约束解除的情况下,通常可以以集中方式训练代理。学习以额外状态信息为条件的联合行动值是一种利用集中学习的有吸引力的方式,但提取分散策略的最佳策略尚不清楚。我们的解决方案是 QMIX,这是一种新颖的基于价值的方法,可以以集中的端到端方式训练分散的策略。QMIX 采用混合网络,将联合动作值估计为每个代理值的单调组合。我们在结构上强制执行联合动作值在每个代理值中是单调的,通过在混合网络中使用非负权重,保证集中和分散策略之间的一致性。为了评估 QMIX 的性能,我们建议将星际争霸多智能体挑战 (SMAC) 作为深度多智能体强化学习的新基准。我们在一组具有挑战性的 SMAC 场景中评估 QMIX,并表明它明显优于现有的多智能体强化学习方法。