当前位置: X-MOL 学术IEEE Trans. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-Task Reinforcement Learning in Reproducing Kernel Hilbert Spaces via Cross-Learning
IEEE Transactions on Signal Processing ( IF 4.6 ) Pub Date : 2021-10-26 , DOI: 10.1109/tsp.2021.3122303
Juan Cervino , Juan Andres Bazerque , Miguel Calvo-Fullana , Alejandro R Ribeiro

Reinforcement learning is a framework to optimize an agent’s policy using rewards that are revealed by the system as a response to an action. In its standard form, reinforcement learning involves a single agent that uses its policy to accomplish a specific task. These methods require large amounts of reward samples to achieve good performance, and may not generalize well when the task is modified, even if the new task is related. In this paper we are interested in a collaborative scheme in which multiple policies are optimized jointly. To this end, we we introduce cross-learning, in which policies are trained for related tasks in separate environments, and they are constrained to be close to one another. Two properties make our new approach attractive: (i) it produces a multi-task central policy that can be used as a starting point to adapt quickly to one of the tasks trained for, and (ii) as in meta-learning, it adapts to environments related but different to those seen during training. We focus on policies belonging to reproducing kernel Hilbert spaces for which we bound the distance between the task-specific policies and the cross-learned policy. To solve the resulting optimization problem, we resort to a projected policy gradient algorithm and prove that it converges to a near-optimal solution with high probability. We evaluate our methodology with a navigation example in which an agent moves through environments with obstacles of multiple shapes and avoids obstacles not trained for.

中文翻译:


通过交叉学习再现核希尔伯特空间的多任务强化学习



强化学习是一个框架,它使用系统显示的作为对操作的响应的奖励来优化代理的策略。在其标准形式中,强化学习涉及单个代理,该代理使用其策略来完成特定任务。这些方法需要大量的奖励样本才能获得良好的性能,并且当任务修改时,即使新任务是相关的,也可能无法很好地泛化。在本文中,我们感兴趣的是联合优化多个策略的协作方案。为此,我们引入了交叉学习,其中策略在不同的环境中针对相关任务进行训练,并且它们被限制为彼此接近。有两个属性使我们的新方法具有吸引力:(i)它产生了一个多任务中心策略,可以用作快速适应所训练的任务之一的起点,以及(ii)与元学习一样,它适应与训练期间看到的环境相关但不同的环境。我们专注于属于复制内核希尔伯特空间的策略,我们限制了特定于任务的策略和交叉学习的策略之间的距离。为了解决由此产生的优化问题,我们采用了投影策略梯度算法,并证明它以高概率收敛到接近最优的解决方案。我们通过一个导航示例来评估我们的方法,在该示例中,代理在具有多种形状障碍物的环境中移动,并避开未经训练的障碍物。
更新日期:2021-10-26
down
wechat
bug