Non-Cooperative Inverse Reinforcement Learning,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Non-Cooperative Inverse Reinforcement Learning
arXiv - CS - Computer Science and Game Theory Pub Date : 2019-11-03 , DOI: arxiv-1911.04220
Xiangyuan Zhang, Kaiqing Zhang, Erik Miehling, Tamer Ba\c{s}ar

Making decisions in the presence of a strategic opponent requires one to take into account the opponent's ability to actively mask its intended objective. To describe such strategic situations, we introduce the non-cooperative inverse reinforcement learning (N-CIRL) formalism. The N-CIRL formalism consists of two agents with completely misaligned objectives, where only one of the agents knows the true objective function. Formally, we model the N-CIRL formalism as a zero-sum Markov game with one-sided incomplete information. Through interacting with the more informed player, the less informed player attempts to both infer, and act according to, the true objective function. As a result of the one-sided incomplete information, the multi-stage game can be decomposed into a sequence of single-stage games expressed by a recursive formula. Solving this recursive formula yields the value of the N-CIRL game and the more informed player's equilibrium strategy. Another recursive formula, constructed by forming an auxiliary game, termed the dual game, yields the less informed player's strategy. Building upon these two recursive formulas, we develop a computationally tractable algorithm to approximately solve for the equilibrium strategies. Finally, we demonstrate the benefits of our N-CIRL formalism over the existing multi-agent IRL formalism via extensive numerical simulation in a novel cyber security setting.

中文翻译：

非合作逆强化学习

在战略对手面前做出决策需要考虑对手主动掩盖其预期目标的能力。为了描述这种战略情况，我们引入了非合作逆强化学习 (N-CIRL) 形式主义。N-CIRL 形式主义由两个目标完全不一致的代理组成，其中只有一个代理知道真正的目标函数。正式地，我们将 N-CIRL 形式化建模为具有单边不完全信息的零和马尔可夫游戏。通过与消息灵通的玩家互动，消息灵通的玩家尝试推断真实的目标函数并根据真实的目标函数采取行动。由于存在片面不完全信息，多阶段博弈可以分解为用递归公式表示的一系列单阶段博弈。解决这个递归公式会产生 N-CIRL 博弈的价值和更明智的参与者的均衡策略。另一个递归公式，通过形成辅助博弈而构建，称为对偶博弈，得出信息较少的玩家的策略。基于这两个递归公式，我们开发了一种计算上易于处理的算法来近似求解均衡策略。最后，我们通过在新型网络安全环境中进行广泛的数值模拟，证明了我们的 N-CIRL 形式主义相对于现有的多智能体 IRL 形式主义的优势。基于这两个递归公式，我们开发了一种计算上易于处理的算法来近似求解均衡策略。最后，我们通过在新型网络安全环境中进行广泛的数值模拟，证明了我们的 N-CIRL 形式主义相对于现有的多智能体 IRL 形式主义的优势。基于这两个递归公式，我们开发了一种计算上易于处理的算法来近似求解均衡策略。最后，我们通过在新型网络安全环境中进行广泛的数值模拟，证明了我们的 N-CIRL 形式主义相对于现有的多智能体 IRL 形式主义的优势。

更新日期：2020-01-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文