Modified action decoder using Bayesian reasoning for multi-agent deep reinforcement learning,International Journal of Machine Learning and Cybernetics

当前位置： X-MOL 学术 › Int. J. Mach. Learn. & Cyber. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Modified action decoder using Bayesian reasoning for multi-agent deep reinforcement learning
International Journal of Machine Learning and Cybernetics ( IF 3.1 ) Pub Date : 2021-07-27 , DOI: 10.1007/s13042-021-01385-7
Wei Du ₁ , Shifei Ding _{1,

2} , Chenglong Zhang ₁ , Shuying Du ₁

Affiliation

Deep reinforcement learning has achieved superhuman performance in zero-sum games such as Go and Poker in recent years. In the real world, however, many scenarios are non-zero-sum settings, meaning that success feels the necessity for cooperation and communication rather than competition. Hanabi game has been established as an ideal benchmark for agents to learn to cooperate adequately with other agents and humans. The Bayesian action decoder methods perform well on the 2 players Hanabi game while there remains a large performance gap between the numbers achieved by these methods and the performance of hat-coding strategies on the 3–5 players settings. The pivotal problem is the contradiction of the exploration of actions against the exploitation of observed actions. We present a novel deep multi-agent reinforcement learning method, the Modified Action Decoder to resolve this problem leveraging centralized training with decentralized execution paradigm. During the training phase, agents not only observe the exploratory action selected but also observe the optimal action of their teammates for better exploitation. We verify our method on Hanabi game in the 2–5 players setting, and it is superior to previously published reinforcement learning methods and establishes a new state of the art.

中文翻译：

使用贝叶斯推理进行多智能体深度强化学习的改进动作解码器

近年来，深度强化学习在围棋、扑克等零和游戏中取得了超人的表现。然而，在现实世界中，许多场景都是非零和设置，这意味着成功的感觉是合作和交流的必要性，而不是竞争。Hanabi 游戏已被确立为代理学习与其他代理和人类充分合作的理想基准。贝叶斯动作解码器方法在 2 名玩家 Hanabi 游戏中表现良好，但这些方法获得的数字与帽子编码策略在 3-5 名玩家设置上的性能之间仍然存在很大的性能差距。关键问题是探索行动与利用观察到的行动之间的矛盾。我们提出了一种新颖的深度多智能体强化学习方法，修改后的动作解码器利用分散执行范式的集中训练来解决这个问题。在训练阶段，智能体不仅观察选择的探索动作，还观察队友的最佳动作以更好地利用。我们在 2-5 名玩家设置的 Hanabi 游戏中验证了我们的方法，它优于以前发布的强化学习方法，并建立了新的最先进技术。

更新日期：2021-08-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11