Open-Ended Learning Leads to Generally Capable Agents,arXiv - CS - Multiagent Systems

当前位置： X-MOL 学术 › arXiv.cs.MA › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Open-Ended Learning Leads to Generally Capable Agents
arXiv - CS - Multiagent Systems Pub Date : 2021-07-27 , DOI: arxiv-2107.12808
Open-Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, Nat McAleese, Nathalie Bradley-Schmieg, Nathaniel Wong, Nicolas Porcel, Roberta Raileanu, Steph Hughes-Fitt, Valentin Dalibard, Wojciech Marian Czarnecki

In this work we create agents that can perform well beyond a single, individual task, that exhibit much wider generalisation of behaviour to a massive, rich space of challenges. We define a universe of tasks within an environment domain and demonstrate the ability to train agents that are generally capable across this vast space and beyond. The environment is natively multi-agent, spanning the continuum of competitive, cooperative, and independent games, which are situated within procedurally generated physical 3D worlds. The resulting space is exceptionally diverse in terms of the challenges posed to agents, and as such, even measuring the learning progress of an agent is an open research problem. We propose an iterative notion of improvement between successive generations of agents, rather than seeking to maximise a singular objective, allowing us to quantify progress despite tasks being incomparable in terms of achievable rewards. We show that through constructing an open-ended learning process, which dynamically changes the training task distributions and training objectives such that the agent never stops learning, we achieve consistent learning of new behaviours. The resulting agent is able to score reward in every one of our humanly solvable evaluation levels, with behaviour generalising to many held-out points in the universe of tasks. Examples of this zero-shot generalisation include good performance on Hide and Seek, Capture the Flag, and Tag. Through analysis and hand-authored probe tasks we characterise the behaviour of our agent, and find interesting emergent heuristic behaviours such as trial-and-error experimentation, simple tool use, option switching, and cooperation. Finally, we demonstrate that the general capabilities of this agent could unlock larger scale transfer of behaviour through cheap finetuning.

中文翻译：

开放式学习导致通用智能体

在这项工作中，我们创建的代理可以超越单个单独的任务，表现出更广泛的行为泛化到巨大的、丰富的挑战空间。我们在环境域内定义了一系列任务，并展示了训练通常能够跨越这个广阔空间及其他地方的代理的能力。环境本身是多智能体，跨越竞争、合作和独立游戏的连续体，这些游戏位于程序生成的物理 3D 世界中。就代理面临的挑战而言，由此产生的空间异常多样，因此，即使衡量代理的学习进度也是一个开放的研究问题。我们提出了连续几代代理之间改进的迭代概念，而不是寻求最大化单一目标，尽管任务在可实现的奖励方面是无与伦比的，但允许我们量化进度。我们表明，通过构建一个开放式学习过程，动态改变训练任务分布和训练目标，使代理永不停止学习，我们实现了对新行为的一致学习。由此产生的代理能够在我们人类可解决的每个评估级别中获得奖励，并将行为推广到任务领域中的许多保留点。这种零样本泛化的例子包括在捉迷藏、夺旗和标记方面的良好表现。通过分析和手工编写的探测任务，我们表征了我们代理的行为，并发现了有趣的紧急启发式行为，例如试错实验、简单的工具使用、选项切换、和合作。最后，我们证明了该代理的一般功能可以通过廉价的微调解锁更大规模的行为转移。

更新日期：2021-07-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文