当前位置: X-MOL 学术arXiv.cs.GT › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Gradient-free Online Learning in Games with Delayed Rewards
arXiv - CS - Computer Science and Game Theory Pub Date : 2020-06-19 , DOI: arxiv-2006.10911
Am\'elie H\'eliou and Panayotis Mertikopoulos and Zhengyuan Zhou

Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium with probability $1$, even if the delay between choosing an action and receiving the corresponding reward is unbounded.

中文翻译:

延迟奖励游戏中的无梯度在线学习

受在线广告和推荐系统应用的启发,我们考虑了一种具有延迟奖励和异步、基于回报的反馈的博弈论模型。与之前关于延迟多臂匪徒的工作相比,我们专注于具有连续动作空间的多人游戏,并且我们检查了遵循无悔学习策略的战略代理的长期行为(但在其他方面忽略了正在玩的游戏、对手的目标等)。为了解决缺乏一致的信息流(例如,奖励可能无序到达,具有先验无界延迟等),我们引入了一种无梯度学习策略,其中将收益信息放在优先队列中当它到达时。在这种一般情况下,我们为代理人的后悔推导出新的界限;此外,
更新日期:2020-06-22
down
wechat
bug