当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget
arXiv - CS - Artificial Intelligence Pub Date : 2021-02-23 , DOI: arxiv-2102.11762
Omkar Shelke, Hardik Meisheri, Harshad Khadilkar

Pommerman is a hybrid cooperative/adversarial multi-agent environment, with challenging characteristics in terms of partial observability, limited or no communication, sparse and delayed rewards, and restrictive computational time limits. This makes it a challenging environment for reinforcement learning (RL) approaches. In this paper, we focus on developing a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games, starting from a fixed base policy (which is itself trained to imitate a noisy expert policy). All RL algorithms starting from the base policy use vanilla proximal-policy optimization (PPO) with the same reward function, and the only difference between their training is the mix and sequence of opponent policies. One expects that beginning training with simpler opponents and then gradually increasing the opponent difficulty will facilitate faster learning, leading to more robust policies compared against a baseline where all available opponent policies are introduced from the start. We test this hypothesis and show that within constrained computational budgets, it is in fact better to "learn in the school of hard knocks", i.e., against all available opponent policies nearly from the start. We also include ablation studies where we study the effect of modifying the base environment properties of ammo and bomb blast strength on the agent performance.

中文翻译:

重拳学校:具有固定计算预算的Pommerman课程分析

Pommerman是合作/对抗多主体的混合环境,在部分可观察性,沟通有限或没有沟通,奖励稀少和延迟以及计算时间限制方面具有挑战性。这使其成为强化学习(RL)方法的挑战性环境。在本文中,我们专注于开发课程,以从固定的基本策略(其本身被模仿模仿嘈杂的专家策略)开始,在100,000个游戏的受限计算预算中学习强大而有前途的策略。所有从基本策略开始的RL算法都使用具有相同奖励功能的香草近端策略优化(PPO),并且它们训练之间的唯一区别是对手策略的混合和顺序。人们期望从较简单的对手开始训练,然后逐渐增加对手难度将有助于更快地学习,与从一开始就引入所有可用对手策略的基准相比,将导致更强大的策略。我们检验了这一假设并表明,在受限制的计算预算内,实际上最好“从硬敲门学习”,即几乎从一开始就反对所有可用的对手政策。我们还包括消融研究,其中我们研究了改变弹药的基本环境属性和炸弹爆炸强度对药剂性能的影响。我们检验了这一假设并表明,在受限制的计算预算内,实际上最好“从硬敲门学习”,即几乎从一开始就反对所有可用的对手政策。我们还包括消融研究,其中我们研究了改变弹药的基本环境属性和炸弹爆炸强度对药剂性能的影响。我们检验了这一假设并表明,在受限制的计算预算内,实际上最好“从硬敲门学习”,即几乎从一开始就反对所有可用的对手政策。我们还包括消融研究,其中我们研究了改变弹药的基本环境属性和炸弹爆炸强度对药剂性能的影响。
更新日期:2021-02-24
down
wechat
bug