Decentralized Q-Learning in Zero-sum Markov Games,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Decentralized Q-Learning in Zero-sum Markov Games
arXiv - CS - Computer Science and Game Theory Pub Date : 2021-06-04 , DOI: arxiv-2106.02748
Muhammed O. Sayin, Kaiqing Zhang, David S. Leslie, Tamer Basar, Asuman Ozdaglar

We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop for the first time a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; the value function estimates converge to the payoffs at a Nash equilibrium when both agents adopt the dynamics. The key challenge in this decentralized setting is the non-stationarity of the learning environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts their policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.

中文翻译：

零和马尔可夫博弈中的去中心化 Q-Learning

我们在无限范围折扣零和马尔可夫游戏中研究多智能体强化学习 (MARL)。我们专注于去中心化 MARL 的实际但具有挑战性的设置，其中代理在没有中央控制器协调的情况下做出决策，而仅基于他们自己的收益和执行的本地操作。智能体不需要观察对手的行为或回报，甚至可能不知道对手的存在，也不知道潜在博弈的零和结构，这种设置在学习文献中也被称为彻底解耦游戏。在本文中，我们首次开发了一种完全解耦的 Q-learning 动态，它既是理性的又是收敛的：学习动态收敛到对对手的最佳反应。当对手遵循渐近平稳策略时的策略；当两个代理都采用动态时，价值函数估计会收敛到纳什均衡下的收益。这种去中心化设置中的关键挑战是从代理的角度来看学习环境的非平稳性，因为她自己的回报和系统进化都取决于其他代理的行为，并且每个代理同时独立地调整他们的策略。为了解决这个问题，我们开发了一个双时间尺度的学习动态，其中每个代理同时更新她的本地 Q 函数和价值函数估计，后者发生在较慢的时间尺度上。这种去中心化设置中的关键挑战是从代理的角度来看学习环境的非平稳性，因为她自己的回报和系统进化都取决于其他代理的行为，并且每个代理同时独立地调整他们的策略。为了解决这个问题，我们开发了一个双时间尺度的学习动态，其中每个代理同时更新她的本地 Q 函数和价值函数估计，后者发生在较慢的时间尺度上。这种去中心化设置中的关键挑战是从代理的角度来看学习环境的非平稳性，因为她自己的回报和系统进化都取决于其他代理的行为，并且每个代理同时独立地调整他们的策略。为了解决这个问题，我们开发了一个双时间尺度的学习动态，其中每个代理同时更新她的本地 Q 函数和价值函数估计，后者发生在较慢的时间尺度上。

更新日期：2021-06-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文