MACRPO: Multi-Agent Cooperative Recurrent Policy Optimization,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MACRPO: Multi-Agent Cooperative Recurrent Policy Optimization
arXiv - CS - Computer Science and Game Theory Pub Date : 2021-09-02 , DOI: arxiv-2109.00882
Eshagh Kargar, Ville Kyrki

This work considers the problem of learning cooperative policies in multi-agent settings with partially observable and non-stationary environments without a communication channel. We focus on improving information sharing between agents and propose a new multi-agent actor-critic method called \textit{Multi-Agent Cooperative Recurrent Proximal Policy Optimization} (MACRPO). We propose two novel ways of integrating information across agents and time in MACRPO: First, we use a recurrent layer in critic's network architecture and propose a new framework to use a meta-trajectory to train the recurrent layer. This allows the network to learn the cooperation and dynamics of interactions between agents, and also handle partial observability. Second, we propose a new advantage function that incorporates other agents' rewards and value functions. We evaluate our algorithm on three challenging multi-agent environments with continuous and discrete action spaces, Deepdrive-Zero, Multi-Walker, and Particle environment. We compare the results with several ablations and state-of-the-art multi-agent algorithms such as QMIX and MADDPG and also single-agent methods with shared parameters between agents such as IMPALA and APEX. The results show superior performance against other algorithms. The code is available online at https://github.com/kargarisaac/macrpo.

中文翻译：

MACRPO：多代理协作循环策略优化

这项工作考虑了在没有通信通道的情况下，在具有部分可观察和非平稳环境的多代理设置中学习合作策略的问题。我们专注于改善代理之间的信息共享，并提出了一种新的多代理 actor-critic 方法，称为 \textit{Multi-Agent Cooperative Recurrent Proximal Policy Optimization}（MACRPO）。我们提出了两种在 MACRPO 中跨代理和时间集成信息的新方法：首先，我们在评论家的网络架构中使用循环层，并提出了一个新框架来使用元轨迹来训练循环层。这允许网络学习代理之间交互的合作和动态，并处理部分可观察性。其次，我们提出了一个新的优势函数，它结合了其他代理的奖励和价值函数。我们在具有连续和离散动作空间的三个具有挑战性的多智能体环境、Deepdrive-Zero、Multi-Walker 和 Particle 环境中评估我们的算法。我们将结果与几种消融和最先进的多智能体算法（如 QMIX 和 MADDPG）以及具有智能体之间共享参数的单智能体方法（如 IMPALA 和 APEX）进行比较。结果显示出相对于其他算法的优越性能。该代码可在 https://github.com/kargarisaac/macrpo 在线获得。我们将结果与几种消融和最先进的多智能体算法（如 QMIX 和 MADDPG）以及具有智能体之间共享参数的单智能体方法（如 IMPALA 和 APEX）进行比较。结果显示出相对于其他算法的优越性能。该代码可在 https://github.com/kargarisaac/macrpo 在线获得。我们将结果与几种消融和最先进的多智能体算法（如 QMIX 和 MADDPG）以及具有智能体之间共享参数的单智能体方法（如 IMPALA 和 APEX）进行比较。结果显示出相对于其他算法的优越性能。该代码可在 https://github.com/kargarisaac/macrpo 在线获得。

更新日期：2021-09-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文