QR-MIX: Distributional Value Function Factorisation for Cooperative Multi-Agent Reinforcement Learning,arXiv - CS - Multiagent Systems

当前位置： X-MOL 学术 › arXiv.cs.MA › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

QR-MIX: Distributional Value Function Factorisation for Cooperative Multi-Agent Reinforcement Learning
arXiv - CS - Multiagent Systems Pub Date : 2020-09-09 , DOI: arxiv-2009.04197
Jian Hu, Seth Austin Harding, Haibin Wu, Siyue Hu, Shih-wei Liao

In Cooperative Multi-Agent Reinforcement Learning (MARL) and under the setting of Centralized Training with Decentralized Execution (CTDE), agents observe and interact with their environment locally and independently. With local observation and random sampling, the randomness in rewards and observations leads to randomness in long-term returns. Existing methods such as Value Decomposition Network (VDN) and QMIX estimate the value of long-term returns as a scalar that does not contain the information of randomness. Our proposed model QR-MIX introduces quantile regression, modeling joint state-action values as a distribution, combining QMIX with Implicit Quantile Network (IQN). However, the monotonicity in QMIX limits the expression of joint state-action value distribution and may lead to incorrect estimation results in non-monotonic cases. Therefore, we proposed a flexible loss function to approximate the monotonicity found in QMIX. Our model is not only more tolerant of the randomness of returns, but also more tolerant of the randomness of monotonic constraints. The experimental results demonstrate that QR-MIX outperforms the previous state-of-the-art method QMIX in the StarCraft Multi-Agent Challenge (SMAC) environment.

中文翻译：

QR-MIX：协同多智能体强化学习的分布价值函数分解

在协同多智能体强化学习 (MARL) 和分散执行集中训练 (CTDE) 的设置下，智能体在本地独立地观察环境并与之交互。通过局部观察和随机抽样，奖励和观察的随机性导致长期回报的随机性。现有的价值分解网络（VDN）和QMIX等方法将长期收益的价值估计为一个不包含随机性信息的标量。我们提出的模型 QR-MIX 引入了分位数回归，将联合状态-动作值建模为分布，将 QMIX 与隐式分位数网络 (IQN) 相结合。然而，QMIX 中的单调性限制了联合状态-动作值分布的表达，并可能导致非单调情况下的估计结果不正确。因此，我们提出了一个灵活的损失函数来近似 QMIX 中发现的单调性。我们的模型不仅更能容忍收益的随机性，而且更能容忍单调约束的随机性。实验结果表明，QR-MIX 在星际争霸多代理挑战 (SMAC) 环境中优于之前最先进的方法 QMIX。

更新日期：2020-10-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>