当前位置:
X-MOL 学术
›
arXiv.cs.MA
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
QR-MIX: Distributional Value Function Factorisation for Cooperative Multi-Agent Reinforcement Learning
arXiv - CS - Multiagent Systems Pub Date : 2020-09-09 , DOI: arxiv-2009.04197 Jian Hu, Seth Austin Harding, Haibin Wu, Siyue Hu, Shih-wei Liao
arXiv - CS - Multiagent Systems Pub Date : 2020-09-09 , DOI: arxiv-2009.04197 Jian Hu, Seth Austin Harding, Haibin Wu, Siyue Hu, Shih-wei Liao
In Cooperative Multi-Agent Reinforcement Learning (MARL) and under the
setting of Centralized Training with Decentralized Execution (CTDE), agents
observe and interact with their environment locally and independently. With
local observation and random sampling, the randomness in rewards and
observations leads to randomness in long-term returns. Existing methods such as
Value Decomposition Network (VDN) and QMIX estimate the value of long-term
returns as a scalar that does not contain the information of randomness. Our
proposed model QR-MIX introduces quantile regression, modeling joint
state-action values as a distribution, combining QMIX with Implicit Quantile
Network (IQN). However, the monotonicity in QMIX limits the expression of joint
state-action value distribution and may lead to incorrect estimation results in
non-monotonic cases. Therefore, we proposed a flexible loss function to
approximate the monotonicity found in QMIX. Our model is not only more tolerant
of the randomness of returns, but also more tolerant of the randomness of
monotonic constraints. The experimental results demonstrate that QR-MIX
outperforms the previous state-of-the-art method QMIX in the StarCraft
Multi-Agent Challenge (SMAC) environment.
中文翻译:
QR-MIX:协同多智能体强化学习的分布价值函数分解
在协同多智能体强化学习 (MARL) 和分散执行集中训练 (CTDE) 的设置下,智能体在本地独立地观察环境并与之交互。通过局部观察和随机抽样,奖励和观察的随机性导致长期回报的随机性。现有的价值分解网络(VDN)和QMIX等方法将长期收益的价值估计为一个不包含随机性信息的标量。我们提出的模型 QR-MIX 引入了分位数回归,将联合状态-动作值建模为分布,将 QMIX 与隐式分位数网络 (IQN) 相结合。然而,QMIX 中的单调性限制了联合状态-动作值分布的表达,并可能导致非单调情况下的估计结果不正确。因此,我们提出了一个灵活的损失函数来近似 QMIX 中发现的单调性。我们的模型不仅更能容忍收益的随机性,而且更能容忍单调约束的随机性。实验结果表明,QR-MIX 在星际争霸多代理挑战 (SMAC) 环境中优于之前最先进的方法 QMIX。
更新日期:2020-10-16
中文翻译:
QR-MIX:协同多智能体强化学习的分布价值函数分解
在协同多智能体强化学习 (MARL) 和分散执行集中训练 (CTDE) 的设置下,智能体在本地独立地观察环境并与之交互。通过局部观察和随机抽样,奖励和观察的随机性导致长期回报的随机性。现有的价值分解网络(VDN)和QMIX等方法将长期收益的价值估计为一个不包含随机性信息的标量。我们提出的模型 QR-MIX 引入了分位数回归,将联合状态-动作值建模为分布,将 QMIX 与隐式分位数网络 (IQN) 相结合。然而,QMIX 中的单调性限制了联合状态-动作值分布的表达,并可能导致非单调情况下的估计结果不正确。因此,我们提出了一个灵活的损失函数来近似 QMIX 中发现的单调性。我们的模型不仅更能容忍收益的随机性,而且更能容忍单调约束的随机性。实验结果表明,QR-MIX 在星际争霸多代理挑战 (SMAC) 环境中优于之前最先进的方法 QMIX。