当前位置: X-MOL 学术Appl. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning
Applied Intelligence ( IF 5.3 ) Pub Date : 2020-06-01 , DOI: 10.1007/s10489-020-01702-7
Diqi Chen , Yizhou Wang , Wen Gao

Multi-objective reinforcement learning (MORL) algorithms aim to approximate the Pareto frontier uniformly in multi-objective decision making problems. In the scenario of deep reinforcement learning (RL), gradient-based methods are often adopted to learn deep policies/value functions due to the fast convergence speed, while pure gradient-based methods can not guarantee a uniformly approximated Pareto frontier. On the other side, evolution strategies straightly manipulate in the solution space to achieve a well-distributed Pareto frontier, but applying evolution strategies to optimize deep networks is still a challenging topic. To leverage the advantages of both kinds of methods, we propose a two-stage MORL framework combining a gradient-based method and an evolution strategy. First, an efficient multi-policy soft actor-critic algorithm is proposed to learn multiple policies collaboratively. The lower layers of all policy networks are shared. The first-stage learning can be regarded as representation learning. Secondly, the multi-objective covariance matrix adaptation evolution strategy (MO-CMA-ES) is applied to fine-tune policy-independent parameters to approach a dense and uniform estimation of the Pareto frontier. Experimental results on three benchmarks (Deep Sea Treasure, Adaptive Streaming, and Super Mario Bros) show the superiority of the proposed method.



中文翻译:

结合基于梯度的方法和演化策略进行多目标强化学习

多目标强化学习(MORL)算法旨在统一估计多目标决策问题中的帕累托边界。在深度强化学习(RL)的场景中,由于收敛速度快,通常采用基于梯度的方法来学习深度策略/价值函数,而基于梯度的纯方法不能保证统一逼近的Pareto边界。另一方面,演化策略直接在解决方案空间中操作以实现分布良好的帕累托边界,但是应用演化策略来优化深度网络仍然是一个具有挑战性的话题。为了利用这两种方法的优势,我们提出了一个两阶段的MORL框架,该框架结合了基于梯度的方法和演化策略。第一,提出了一种有效的多策略软参与者批评算法,以协同学习多个策略。所有策略网络的下层都是共享的。第一阶段的学习可以看作是表征学习。其次,将多目标协方差矩阵适应进化策略(MO-CMA-ES)用于微调与策略无关的参数,以求得帕累托边界的密集且一致的估计。在三个基准(深海宝藏,自适应流和超级马里奥兄弟)上的实验结果表明了该方法的优越性。多目标协方差矩阵适应进化策略(MO-CMA-ES)被用于微调与策略无关的参数,以求得帕累托边界的密集且统一的估计。在三个基准(深海宝藏,自适应流和超级马里奥兄弟)上的实验结果表明了该方法的优越性。多目标协方差矩阵适应进化策略(MO-CMA-ES)被用于微调与策略无关的参数,以求得帕累托边界的密集且统一的估计。在三个基准(深海宝藏,自适应流和超级马里奥兄弟)上的实验结果表明了该方法的优越性。

更新日期:2020-06-01
down
wechat
bug