当前位置: X-MOL 学术arXiv.cs.NE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evolutionary Selective Imitation: Interpretable Agents by Imitation Learning Without a Demonstrator
arXiv - CS - Neural and Evolutionary Computing Pub Date : 2020-09-17 , DOI: arxiv-2009.08403
Roy Eliya, J. Michael Herrmann

We propose a new method for training an agent via an evolutionary strategy (ES), in which we iteratively improve a set of samples to imitate: Starting with a random set, in every iteration we replace a subset of the samples with samples from the best trajectories discovered so far. The evaluation procedure for this set is to train, via supervised learning, a randomly initialised neural network (NN) to imitate the set and then execute the acquired policy against the environment. Our method is thus an ES based on a fitness function that expresses the effectiveness of imitating an evolving data subset. This is in contrast to other ES techniques that iterate over the weights of the policy directly. By observing the samples that the agent selects for learning, it is possible to interpret and evaluate the evolving strategy of the agent more explicitly than in NN learning. In our experiments, we trained an agent to solve the OpenAI Gym environment Bipedalwalker-v3 by imitating an evolutionarily selected set of only 25 samples with a NN with only a few thousand parameters. We further test our method on the Procgen game Plunder and show here as well that the proposed method is an interpretable, small, robust and effective alternative to other ES or policy gradient methods.

中文翻译:

进化选择性模仿:通过模仿学习而无需演示者的可解释代理

我们提出了一种通过进化策略 (ES) 训练智能体的新方法,在该方法中,我们迭代地改进一组要模仿的样本:从随机集开始,在每次迭代中,我们用最好的样本替换样本子集目前发现的轨迹。该集合的评估过程是通过监督学习训练随机初始化的神经网络 (NN) 来模仿该集合,然后针对环境执行获得的策略。因此,我们的方法是基于适应度函数的 ES,该适应度函数表示模仿不断发展的数据子集的有效性。这与直接迭代策略权重的其他 ES 技术形成对比。通过观察agent选择学习的样本,与神经网络学习相比,可以更明确地解释和评估代理的演化策略。在我们的实验中,我们训练了一个代理来解决 OpenAI Gym 环境 Bipedalwalker-v3,方法是模仿进化选择的只有 25 个样本的集合,并且神经网络只有几千个参数。我们在 Procgen 游戏 Plunder 上进一步测试了我们的方法,并在此展示了所提出的方法是其他 ES 或策略梯度方法的可解释、小、稳健且有效的替代方法。
更新日期:2020-09-18
down
wechat
bug