Opponent Learning Awareness and Modelling in Multi-Objective Normal Form Games,arXiv - CS - Multiagent Systems

当前位置： X-MOL 学术 › arXiv.cs.MA › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Opponent Learning Awareness and Modelling in Multi-Objective Normal Form Games
arXiv - CS - Multiagent Systems Pub Date : 2020-11-14 , DOI: arxiv-2011.07290
Roxana R\u{a}dulescu, Timothy Verstraeten, Yijie Zhang, Patrick Mannion, Diederik M. Roijers, Ann Now\'e

Many real-world multi-agent interactions consider multiple distinct criteria, i.e. the payoffs are multi-objective in nature. However, the same multi-objective payoff vector may lead to different utilities for each participant. Therefore, it is essential for an agent to learn about the behaviour of other agents in the system. In this work, we present the first study of the effects of such opponent modelling on multi-objective multi-agent interactions with non-linear utilities. Specifically, we consider two-player multi-objective normal form games with non-linear utility functions under the scalarised expected returns optimisation criterion. We contribute novel actor-critic and policy gradient formulations to allow reinforcement learning of mixed strategies in this setting, along with extensions that incorporate opponent policy reconstruction and learning with opponent learning awareness (i.e., learning while considering the impact of one's policy when anticipating the opponent's learning step). Empirical results in five different MONFGs demonstrate that opponent learning awareness and modelling can drastically alter the learning dynamics in this setting. When equilibria are present, opponent modelling can confer significant benefits on agents that implement it. When there are no Nash equilibria, opponent learning awareness and modelling allows agents to still converge to meaningful solutions that approximate equilibria.

中文翻译：

多目标范式游戏中的对手学习意识与建模

许多现实世界的多代理交互考虑多个不同的标准，即收益本质上是多目标的。然而，相同的多目标收益向量可能会导致每个参与者的效用不同。因此，代理了解系统中其他代理的行为至关重要。在这项工作中，我们首次研究了这种对手建模对具有非线性效用的多目标多代理交互的影响。具体来说，我们考虑在标量化预期收益优化标准下具有非线性效用函数的两人多目标范式博弈。我们贡献了新颖的 actor-critic 和策略梯度公式，以允许在这种情况下对混合策略进行强化学习，以及将对手策略重建和学习与对手学习意识相结合的扩展（即，在预测对手的学习步骤时，在考虑策略影响的同时进行学习）。五个不同 MONFG 的实证结果表明，对手的学习意识和建模可以极大地改变这种情况下的学习动态。当存在均衡时，对手建模可以为实现它的代理带来显着的好处。当没有纳什均衡时，对手学习意识和建模允许代理仍然收敛到接近均衡的有意义的解决方案。五个不同 MONFG 的实证结果表明，对手的学习意识和建模可以极大地改变这种情况下的学习动态。当存在均衡时，对手建模可以为实现它的代理带来显着的好处。当没有纳什均衡时，对手学习意识和建模允许代理仍然收敛到接近均衡的有意义的解决方案。五个不同 MONFG 的实证结果表明，对手的学习意识和建模可以极大地改变这种情况下的学习动态。当存在均衡时，对手建模可以为实现它的代理带来显着的好处。当没有纳什均衡时，对手学习意识和建模允许代理仍然收敛到接近均衡的有意义的解决方案。

更新日期：2020-11-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>