当前位置: X-MOL 学术arXiv.cs.MA › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Communication Efficient Parallel Reinforcement Learning
arXiv - CS - Multiagent Systems Pub Date : 2021-02-22 , DOI: arxiv-2102.10740
Mridul Agarwal, Bhargav Ganguly, Vaneet Aggarwal

We consider the problem where $M$ agents interact with $M$ identical and independent environments with $S$ states and $A$ actions using reinforcement learning for $T$ rounds. The agents share their data with a central server to minimize their regret. We aim to find an algorithm that allows the agents to minimize the regret with infrequent communication rounds. We provide \NAM\ which runs at each agent and prove that the total cumulative regret of $M$ agents is upper bounded as $\Tilde{O}(DS\sqrt{MAT})$ for a Markov Decision Process with diameter $D$, number of states $S$, and number of actions $A$. The agents synchronize after their visitations to any state-action pair exceeds a certain threshold. Using this, we obtain a bound of $O\left(MSA\log(MT)\right)$ on the total number of communications rounds. Finally, we evaluate the algorithm against multiple environments and demonstrate that the proposed algorithm performs at par with an always communication version of the UCRL2 algorithm, while with significantly lower communication.

中文翻译:

有效沟通的并行强化学习

我们考虑以下问题:$ M $代理与$ M $相同且独立的环境通过$ S $状态和$ A $动作(使用强化学习进行$ T $回合)进行交互,该环境具有$ S $状态和$ A $动作。代理与中央服务器共享数据,以最大程度地减少后悔。我们的目标是找到一种算法,该算法可使代理在不频繁的通信回合中最大程度地减少后悔。我们提供在每个代理程序上运行的\ NAM \,并证明对于直径为$ D的马尔可夫决策过程,$ M $代理程序的总累积后悔上限为$ \ Tilde {O}(DS \ sqrt {MAT})$ $,州数$ S $和动作数$ A $。代理对任何状态动作对的访问超过某个阈值后,就会进行同步。使用此方法,我们获得了通信回合总数上的$ O \ left(MSA \ log(MT)\ right)$的边界。最后,
更新日期:2021-02-23
down
wechat
bug