Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning,Applied Intelligence

当前位置： X-MOL 学术 › Appl. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning
Applied Intelligence ( IF 5.3 ) Pub Date : 2020-06-01 , DOI: 10.1007/s10489-020-01702-7
Diqi Chen , Yizhou Wang , Wen Gao

Multi-objective reinforcement learning (MORL) algorithms aim to approximate the Pareto frontier uniformly in multi-objective decision making problems. In the scenario of deep reinforcement learning (RL), gradient-based methods are often adopted to learn deep policies/value functions due to the fast convergence speed, while pure gradient-based methods can not guarantee a uniformly approximated Pareto frontier. On the other side, evolution strategies straightly manipulate in the solution space to achieve a well-distributed Pareto frontier, but applying evolution strategies to optimize deep networks is still a challenging topic. To leverage the advantages of both kinds of methods, we propose a two-stage MORL framework combining a gradient-based method and an evolution strategy. First, an efficient multi-policy soft actor-critic algorithm is proposed to learn multiple policies collaboratively. The lower layers of all policy networks are shared. The first-stage learning can be regarded as representation learning. Secondly, the multi-objective covariance matrix adaptation evolution strategy (MO-CMA-ES) is applied to fine-tune policy-independent parameters to approach a dense and uniform estimation of the Pareto frontier. Experimental results on three benchmarks (Deep Sea Treasure, Adaptive Streaming, and Super Mario Bros) show the superiority of the proposed method.

中文翻译：

结合基于梯度的方法和演化策略进行多目标强化学习

多目标强化学习（MORL）算法旨在统一估计多目标决策问题中的帕累托边界。在深度强化学习（RL）的场景中，由于收敛速度快，通常采用基于梯度的方法来学习深度策略/价值函数，而基于梯度的纯方法不能保证统一逼近的Pareto边界。另一方面，演化策略直接在解决方案空间中操作以实现分布良好的帕累托边界，但是应用演化策略来优化深度网络仍然是一个具有挑战性的话题。为了利用这两种方法的优势，我们提出了一个两阶段的MORL框架，该框架结合了基于梯度的方法和演化策略。第一，提出了一种有效的多策略软参与者批评算法，以协同学习多个策略。所有策略网络的下层都是共享的。第一阶段的学习可以看作是表征学习。其次，将多目标协方差矩阵适应进化策略（MO-CMA-ES）用于微调与策略无关的参数，以求得帕累托边界的密集且一致的估计。在三个基准（深海宝藏，自适应流和超级马里奥兄弟）上的实验结果表明了该方法的优越性。多目标协方差矩阵适应进化策略（MO-CMA-ES）被用于微调与策略无关的参数，以求得帕累托边界的密集且统一的估计。在三个基准（深海宝藏，自适应流和超级马里奥兄弟）上的实验结果表明了该方法的优越性。多目标协方差矩阵适应进化策略（MO-CMA-ES）被用于微调与策略无关的参数，以求得帕累托边界的密集且统一的估计。在三个基准（深海宝藏，自适应流和超级马里奥兄弟）上的实验结果表明了该方法的优越性。

更新日期：2020-06-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>