A Multi-tier Reinforcement Learning Model for a Cooperative Multi-agent System,IEEE Transactions on Cognitive and Developmental Systems

当前位置： X-MOL 学术 › IEEE Trans. Cogn. Dev. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Multi-tier Reinforcement Learning Model for a Cooperative Multi-agent System
IEEE Transactions on Cognitive and Developmental Systems ( IF 5 ) Pub Date : 2020-09-01 , DOI: 10.1109/tcds.2020.2970487
Haobin Shi , Liangjing Zhai , Haibo Wu , Maxwell Hwang , Kao-Shing Hwang , Hsuan-Pei Hsu

In multiagent cooperative systems with value-based reinforcement learning, agents learn how to complete the task by an optimal policy learned through value-policy improvement iterations. But how to design a policy that avoids cooperation dilemmas and comes to a common consensus between agents is an important issue. A method that improves the coordination ability of agents in cooperative systems by assessing the cooperative tendency and increases the collective payoff by candidate policy is proposed in this article. The method learns the cooperative rules by recording the cooperation probabilities for agents in a multitier reinforcement learning model. The candidate action sets are selected through the candidate policy which considers the payoff of the coalition. Then, the optimal strategy is selected through the Nash bargaining solution (NBS) from these candidate action sets. The method is tested using two cooperative tasks. The results show that the proposed algorithm, which addresses the instability and ambiguity in a win or learning fast policy hill-climbing (WoLF-PHC) and requires significantly less memory space than the NBS, is more stable and more efficient than other methods.

中文翻译：

一种协作多智能体系统的多层强化学习模型

在具有基于价值的强化学习的多智能体合作系统中，智能体通过价值-策略改进迭代学习到的最优策略来学习如何完成任务。但是如何设计一种策略，避免合作困境并在代理之间达成共识是一个重要的问题。本文提出了一种通过评估合作趋势来提高合作系统中智能体协调能力并通过候选策略增加集体收益的方法。该方法通过在多层强化学习模型中记录代理的合作概率来学习合作规则。候选动作集是通过考虑联盟收益的候选策略来选择的。然后，通过纳什讨价还价解决方案（NBS）从这些候选动作集中选择最优策略。该方法使用两个协作任务进行测试。结果表明，所提出的算法解决了胜利或学习快速策略爬山（WoLF-PHC）中的不稳定性和模糊性，并且比 NBS 需要的内存空间少得多，比其他方法更稳定、更有效。

更新日期：2020-09-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>