Fully distributed actor-critic architecture for multitask deep reinforcement learning,The Knowledge Engineering Review

当前位置： X-MOL 学术 › Knowl. Eng. Rev. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fully distributed actor-critic architecture for multitask deep reinforcement learning
The Knowledge Engineering Review ( IF 2.1 ) Pub Date : 2021-04-16 , DOI: 10.1017/s0269888921000023
Sergio Valcarcel Macua , Ian Davies , Aleksi Tukiainen , Enrique Munoz de Cote

We propose a fully distributed actor-critic architecture, named diffusion-distributed-actor-critic Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a common policy that performs well for the whole set of tasks. The architecture is scalable, since the computational and communication cost per agent depends on the number of neighbours rather than the overall number of agents. We derive Diff-DAC from duality theory and provide novel insights into the actor-critic framework, showing that it is actually an instance of the dual-ascent method. We prove almost sure convergence of Diff-DAC to a common policy under general assumptions that hold even for deep neural network approximations. For more restrictive assumptions, we also prove that this common policy is a stationary point of an approximation of the original problem. Numerical results on multitask extensions of common continuous control benchmarks demonstrate that Diff-DAC stabilises learning and has a regularising effect that induces higher performance and better generalisation properties than previous architectures.

中文翻译：

用于多任务深度强化学习的完全分布式actor-critic架构

我们提出了一个完全分布式的actor-critic架构，命名为diffusion-distributed-actor-critic差分DAC，应用于多任务强化学习（MRL）。在学习过程中，代理将他们的价值和政策参数传达给他们的邻居，在代理网络中传播信息，而无需中央站。每个代理只能从其本地任务中访问数据，但旨在学习一个对整个任务集表现良好的通用策略。该架构是可扩展的，因为每个代理的计算和通信成本取决于邻居的数量而不是代理的总数。我们从对偶理论中推导出 Diff-DAC，并对 actor-critic 框架提供了新的见解，表明它实际上是双上升法的一个实例。我们证明了 Diff-DAC 在一般假设下几乎可以肯定地收敛到一个通用策略，即使对于深度神经网络近似也是如此。对于更具限制性的假设，我们还证明了这种通用策略是原始问题近似的驻点。对常见连续控制基准的多任务扩展的数值结果表明，Diff-DAC 稳定了学习，并具有比以前的架构更高的性能和更好的泛化特性的正则化效果。

更新日期：2021-04-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>