Learning adversarial policy in multiple scenes environment via multi-agent reinforcement learning,Connection Science

当前位置： X-MOL 学术 › Connect. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning adversarial policy in multiple scenes environment via multi-agent reinforcement learning
Connection Science ( IF 3.2 ) Pub Date : 2020-11-03 , DOI: 10.1080/09540091.2020.1832961
Yang Li ₁ , Xinzhi Wang ₁ , Wei Wang ₁ , Zhenyu Zhang ₁ , Jianshu Wang ₁ , Xiangfeng Luo ₁ , Shaorong Xie ₁

Affiliation

ABSTRACT

Learning adversarial policy aims to learn behavioural strategies for agents with different goals, is one of the most significant tasks in multi-agent systems. Multi-agent reinforcement learning (MARL), as a state-of-the-art learning-based model, employs centralised or decentralised control methods to learn behavioural strategies by interacting with environments. It suffers from instability and slowness in the training process. Considering that parallel simulation or computation is an effective way to improve training performance, we propose a novel MARL method called Multiple scenes multi-agent proximal Policy Optimisation (MPO) in this paper. In MPO, we first simulate multiple parallel scenes in the training environment. Multiple policies control different agents in the same scene, and each policy also controls several identical agents from multiple scenes. Then, we expand proximal policy optimisation (PPO) with an improved actor-critic network, ensuring the stability of training in multi-agent tasks. The actor network only uses local information for decision making, and the critic network uses global information for training. Finally, effective training trajectories are computed with two criteria from multiple parallel scenes rather than single to accelerate the learning process. We evaluate our approach in two simulated 3D environments, one of which is Unity's official open-source soccer game, and the other is unmanned surface vehicles (USVs) built by Unity. Experiments demonstrate that MPO converges more stable and faster than benchmark methods in model training, and demonstrates excellent adversarial policy compared with benchmark models.

中文翻译：

通过多智能体强化学习在多场景环境中学习对抗策略

摘要

学习对抗策略旨在学习具有不同目标的代理的行为策略，是多代理系统中最重要的任务之一。多智能体强化学习 (MARL) 作为最先进的基于学习的模型，采用集中或分散的控制方法通过与环境交互来学习行为策略。它在训练过程中受到不稳定和缓慢的影响。考虑到并行模拟或计算是提高培训绩效的有效途径，我们建议称为新的MARL方法中号ultiple场景多代理近端P olicy Ø本文中的优化 (MPO)。在 MPO 中，我们首先在训练环境中模拟多个并行场景。多个策略控制同一场景中的不同代理，每个策略还控制来自多个场景的几个相同的代理。然后，我们使用改进的 actor-critic 网络扩展了近端策略优化 (PPO)，确保了多智能体任务训练的稳定性。演员网络仅使用本地信息进行决策，评论家网络使用全局信息进行训练。最后，使用来自多个并行场景的两个标准而不是单个标准来计算有效的训练轨迹以加速学习过程。我们在两个模拟 3D 环境中评估我们的方法，其中之一是 Unity 的官方开源足球游戏，另一个是由 Unity 制造的无人水面车辆 (USV)。实验表明，MPO 在模型训练中比基准方法收敛更稳定、更快，并且与基准模型相比展示了出色的对抗策略。

更新日期：2020-11-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11