当前位置:
X-MOL 学术
›
arXiv.cs.GT
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games
arXiv - CS - Computer Science and Game Theory Pub Date : 2021-06-09 , DOI: arxiv-2106.04958 Xiangyu Liu, Hangtian Jia, Ying Wen, Yaodong Yang, Yujing Hu, Yingfeng Chen, Changjie Fan, Zhipeng Hu
arXiv - CS - Computer Science and Game Theory Pub Date : 2021-06-09 , DOI: arxiv-2106.04958 Xiangyu Liu, Hangtian Jia, Ying Wen, Yaodong Yang, Yujing Hu, Yingfeng Chen, Changjie Fan, Zhipeng Hu
Measuring and promoting policy diversity is critical for solving games with
strong non-transitive dynamics where strategic cycles exist, and there is no
consistent winner (e.g., Rock-Paper-Scissors). With that in mind, maintaining a
pool of diverse policies via open-ended learning is an attractive solution,
which can generate auto-curricula to avoid being exploited. However, in
conventional open-ended learning algorithms, there are no widely accepted
definitions for diversity, making it hard to construct and evaluate the diverse
policies. In this work, we summarize previous concepts of diversity and work
towards offering a unified measure of diversity in multi-agent open-ended
learning to include all elements in Markov games, based on both Behavioral
Diversity (BD) and Response Diversity (RD). At the trajectory distribution
level, we re-define BD in the state-action space as the discrepancies of
occupancy measures. For the reward dynamics, we propose RD to characterize
diversity through the responses of policies when encountering different
opponents. We also show that many current diversity measures fall in one of the
categories of BD or RD but not both. With this unified diversity measure, we
design the corresponding diversity-promoting objective and population
effectivity when seeking the best responses in open-ended learning. We validate
our methods in both relatively simple games like matrix game, non-transitive
mixture model, and the complex \textit{Google Research Football} environment.
The population found by our methods reveals the lowest exploitability, highest
population effectivity in matrix game and non-transitive mixture model, as well
as the largest goal difference when interacting with opponents of various
levels in \textit{Google Research Football}.
中文翻译:
在零和游戏中统一开放式学习的行为和响应多样性
衡量和促进政策多样性对于解决具有强非传递动态的博弈至关重要,其中存在战略周期,并且没有一致的赢家(例如,石头剪刀布)。考虑到这一点,通过开放式学习维护一个多样化的政策池是一个有吸引力的解决方案,它可以生成自动课程以避免被利用。然而,在传统的开放式学习算法中,多样性没有被广泛接受的定义,使得构建和评估多样性策略变得困难。在这项工作中,我们总结了先前的多样性概念,并致力于在多智能体开放式学习中提供统一的多样性度量,以包括基于行为多样性 (BD) 和响应多样性 (RD) 的马尔可夫博弈中的所有元素。在轨迹分布层面,我们将状态-动作空间中的 BD 重新定义为占用度量的差异。对于奖励动态,我们建议 RD 通过遇到不同对手时的策略响应来表征多样性。我们还表明,许多当前的多样性措施属于 BD 或 RD 类别之一,但不是两者都属于。通过这种统一的多样性度量,我们在开放式学习中寻求最佳响应时,设计了相应的多样性促进目标和种群有效性。我们在相对简单的游戏(如矩阵游戏、非传递混合模型)和复杂的 \textit{Google Research Football} 环境中验证了我们的方法。我们的方法发现的种群在矩阵博弈和非传递混合模型中显示出最低的可利用性,最高的种群效率,
更新日期:2021-06-10
中文翻译:
在零和游戏中统一开放式学习的行为和响应多样性
衡量和促进政策多样性对于解决具有强非传递动态的博弈至关重要,其中存在战略周期,并且没有一致的赢家(例如,石头剪刀布)。考虑到这一点,通过开放式学习维护一个多样化的政策池是一个有吸引力的解决方案,它可以生成自动课程以避免被利用。然而,在传统的开放式学习算法中,多样性没有被广泛接受的定义,使得构建和评估多样性策略变得困难。在这项工作中,我们总结了先前的多样性概念,并致力于在多智能体开放式学习中提供统一的多样性度量,以包括基于行为多样性 (BD) 和响应多样性 (RD) 的马尔可夫博弈中的所有元素。在轨迹分布层面,我们将状态-动作空间中的 BD 重新定义为占用度量的差异。对于奖励动态,我们建议 RD 通过遇到不同对手时的策略响应来表征多样性。我们还表明,许多当前的多样性措施属于 BD 或 RD 类别之一,但不是两者都属于。通过这种统一的多样性度量,我们在开放式学习中寻求最佳响应时,设计了相应的多样性促进目标和种群有效性。我们在相对简单的游戏(如矩阵游戏、非传递混合模型)和复杂的 \textit{Google Research Football} 环境中验证了我们的方法。我们的方法发现的种群在矩阵博弈和非传递混合模型中显示出最低的可利用性,最高的种群效率,