Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games
arXiv - CS - Computer Science and Game Theory Pub Date : 2021-06-09 , DOI: arxiv-2106.04958
Xiangyu Liu, Hangtian Jia, Ying Wen, Yaodong Yang, Yujing Hu, Yingfeng Chen, Changjie Fan, Zhipeng Hu

Measuring and promoting policy diversity is critical for solving games with strong non-transitive dynamics where strategic cycles exist, and there is no consistent winner (e.g., Rock-Paper-Scissors). With that in mind, maintaining a pool of diverse policies via open-ended learning is an attractive solution, which can generate auto-curricula to avoid being exploited. However, in conventional open-ended learning algorithms, there are no widely accepted definitions for diversity, making it hard to construct and evaluate the diverse policies. In this work, we summarize previous concepts of diversity and work towards offering a unified measure of diversity in multi-agent open-ended learning to include all elements in Markov games, based on both Behavioral Diversity (BD) and Response Diversity (RD). At the trajectory distribution level, we re-define BD in the state-action space as the discrepancies of occupancy measures. For the reward dynamics, we propose RD to characterize diversity through the responses of policies when encountering different opponents. We also show that many current diversity measures fall in one of the categories of BD or RD but not both. With this unified diversity measure, we design the corresponding diversity-promoting objective and population effectivity when seeking the best responses in open-ended learning. We validate our methods in both relatively simple games like matrix game, non-transitive mixture model, and the complex \textit{Google Research Football} environment. The population found by our methods reveals the lowest exploitability, highest population effectivity in matrix game and non-transitive mixture model, as well as the largest goal difference when interacting with opponents of various levels in \textit{Google Research Football}.

中文翻译：

在零和游戏中统一开放式学习的行为和响应多样性

衡量和促进政策多样性对于解决具有强非传递动态的博弈至关重要，其中存在战略周期，并且没有一致的赢家（例如，石头剪刀布）。考虑到这一点，通过开放式学习维护一个多样化的政策池是一个有吸引力的解决方案，它可以生成自动课程以避免被利用。然而，在传统的开放式学习算法中，多样性没有被广泛接受的定义，使得构建和评估多样性策略变得困难。在这项工作中，我们总结了先前的多样性概念，并致力于在多智能体开放式学习中提供统一的多样性度量，以包括基于行为多样性 (BD) 和响应多样性 (RD) 的马尔可夫博弈中的所有元素。在轨迹分布层面，我们将状态-动作空间中的 BD 重新定义为占用度量的差异。对于奖励动态，我们建议 RD 通过遇到不同对手时的策略响应来表征多样性。我们还表明，许多当前的多样性措施属于 BD 或 RD 类别之一，但不是两者都属于。通过这种统一的多样性度量，我们在开放式学习中寻求最佳响应时，设计了相应的多样性促进目标和种群有效性。我们在相对简单的游戏（如矩阵游戏、非传递混合模型）和复杂的 \textit{Google Research Football} 环境中验证了我们的方法。我们的方法发现的种群在矩阵博弈和非传递混合模型中显示出最低的可利用性，最高的种群效率，

更新日期：2021-06-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文