当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Corrigibility with Utility Preservation
arXiv - CS - Artificial Intelligence Pub Date : 2019-08-05 , DOI: arxiv-1908.01695
Koen Holtman

Corrigibility is a safety property for artificially intelligent agents. A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI). The layer counter-acts the emergent incentive of advanced agents to resist such alteration. A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes. The corrigible agents have an emergent incentive to protect key elements of their corrigibility layer. However, hostile universes may contain forces strong enough to break safety features. Some open problems related to graceful degradation when an agent is successfully attacked are identified. The results in this paper were obtained by concurrently developing an AGI agent simulator, an agent model, and proofs. The simulator is available under an open source license. The paper contains simulation results which illustrate the safety related properties of corrigible AGI agents in detail.

中文翻译:

效用保护的可修正性

可操纵性是人工智能代理的安全属性。可纠正的代理将不会抵制授权方试图更改代理首次启动时编码的目标和约束。本文展示了如何构建一个安全层,为任意先进的效用最大化代理增加可修正性,包括未来可能使用通用人工智能 (AGI) 的代理。该层抵消了高级代理抵抗这种改变的涌现动机。开发了一个可以推理保留其效用函数的代理的详细模型,并用于证明可修正层在大量非敌对宇宙中按预期工作。可纠正的代理人具有保护其可纠正层的关键元素的紧急动机。然而,敌对宇宙可能包含强大到足以破坏安全功能的力量。当代理被成功攻击时,一些与优雅降级相关的开放问题被识别出来。本文的结果是通过同时开发 AGI 代理模拟器、代理模型和证明获得的。该模拟器在开源许可下可用。该论文包含的模拟结果详细说明了可纠正的 AGI 代理的安全相关特性。
更新日期:2020-04-06
down
wechat
bug