Gradient-free Online Learning in Games with Delayed Rewards,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Gradient-free Online Learning in Games with Delayed Rewards
arXiv - CS - Computer Science and Game Theory Pub Date : 2020-06-19 , DOI: arxiv-2006.10911
Am\'elie H\'eliou and Panayotis Mertikopoulos and Zhengyuan Zhou

Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium with probability $1$, even if the delay between choosing an action and receiving the corresponding reward is unbounded.

中文翻译：

延迟奖励游戏中的无梯度在线学习

受在线广告和推荐系统应用的启发，我们考虑了一种具有延迟奖励和异步、基于回报的反馈的博弈论模型。与之前关于延迟多臂匪徒的工作相比，我们专注于具有连续动作空间的多人游戏，并且我们检查了遵循无悔学习策略的战略代理的长期行为（但在其他方面忽略了正在玩的游戏、对手的目标等）。为了解决缺乏一致的信息流（例如，奖励可能无序到达，具有先验无界延迟等），我们引入了一种无梯度学习策略，其中将收益信息放在优先队列中当它到达时。在这种一般情况下，我们为代理人的后悔推导出新的界限；此外，

更新日期：2020-06-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文