当前位置: X-MOL 学术arXiv.cs.GT › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Reinforcement Learning in Non-Stationary Discrete-Time Linear-Quadratic Mean-Field Games
arXiv - CS - Computer Science and Game Theory Pub Date : 2020-09-09 , DOI: arxiv-2009.04350
Muhammad Aneeq uz Zaman, Kaiqing Zhang, Erik Miehling, and Tamer Ba\c{s}ar

In this paper, we study large population multi-agent reinforcement learning (RL) in the context of discrete-time linear-quadratic mean-field games (LQ-MFGs). Our setting differs from most existing work on RL for MFGs, in that we consider a non-stationary MFG over an infinite horizon. We propose an actor-critic algorithm to iteratively compute the mean-field equilibrium (MFE) of the LQ-MFG. There are two primary challenges: i) the non-stationarity of the MFG induces a linear-quadratic tracking problem, which requires solving a backwards-in-time (non-causal) equation that cannot be solved by standard (causal) RL algorithms; ii) Many RL algorithms assume that the states are sampled from the stationary distribution of a Markov chain (MC), that is, the chain is already mixed, an assumption that is not satisfied for real data sources. We first identify that the mean-field trajectory follows linear dynamics, allowing the problem to be reformulated as a linear quadratic Gaussian problem. Under this reformulation, we propose an actor-critic algorithm that allows samples to be drawn from an unmixed MC. Finite-sample convergence guarantees for the algorithm are then provided. To characterize the performance of our algorithm in multi-agent RL, we have developed an error bound with respect to the Nash equilibrium of the finite-population game.

中文翻译:

非平稳离散时间线性二次平均场博弈中的强化学习

在本文中,我们在离散时间线性二次平均场博弈 (LQ-MFG) 的背景下研究了大规模多智能体强化学习 (RL)。我们的设置与 MFG 的 RL 的大多数现有工作不同,因为我们考虑了无限范围内的非平稳 MFG。我们提出了一种actor-critic算法来迭代计算LQ-MFG的平均场平衡(MFE)。有两个主要挑战:i) MFG 的非平稳性导致线性二次跟踪问题,这需要求解标准(因果)RL 算法无法解决的时间倒退(非因果)方程;ii) 许多 RL 算法假设状态是从马尔可夫链 (MC) 的平稳分布中采样的,也就是说,该链已经是混合的,这种假设对于真实数据源是不满足的。我们首先确定平均场轨迹遵循线性动力学,允许将问题重新表述为线性二次高斯问题。在这种重新表述下,我们提出了一种演员-评论家算法,该算法允许从未混合的 MC 中抽取样本。然后提供算法的有限样本收敛保证。为了表征我们的算法在多智能体 RL 中的性能,我们开发了一个关于有限种群博弈的纳什均衡的误差界限。
更新日期:2020-10-02
down
wechat
bug