Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator,Journal of Grid Computing

当前位置： X-MOL 学术 › J. Grid Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator
Journal of Grid Computing ( IF 3.6 ) Pub Date : 2020-04-18 , DOI: 10.1007/s10723-020-09512-4
Shan Zhong , Jack Tan , Husheng Dong , Xuemei Chen , Shengrong Gong , Zhenjiang Qian

The tasks with continuous state and action spaces are difficult to be solved with high sample efficiency. Model learning and planning, as a well-known method to improve the sample efficiency, is achieved by learning a system dynamics model first and then using it for planning. However, the convergence of the algorithm will be slowed if the system dynamics model is not captured accurately, with the consequence of low sample efficiency. Therefore, to solve the problems with continuous state and action spaces, a model-learning-based actor-critic algorithm with the Gaussian process approximator is proposed, named MLAC-GPA, where the Gaussian process is selected as the modeling method due to its valuable characteristics of capturing the noise and uncertainty of the underlying system. The model in MLAC-GPA is firstly represented by linear function approximation and then modeled by the Gaussian process. Afterward, the expectation value vector and the covariance matrix of the model parameter are estimated by Bayesian reasoning. The model is used for planning after being learned, to accelerate the convergence of the value function and the policy. Experimentally, the proposed method MLAC-GPA is implemented and compared with five representative methods in three classic benchmarks, Pole Balancing, Inverted Pendulum, and Mountain Car. The result shows MLAC-GPA overcomes the others both in learning rate and sample efficiency.

中文翻译：

高斯过程近似器的基于建模学习的Actor-Critic算法

具有连续状态和动作空间的任务很难以高采样效率解决。作为一种提高样本效率的众所周知的方法，模型学习和计划是通过首先学习系统动力学模型然后将其用于计划来实现的。但是，如果未正确捕获系统动力学模型，则会降低算法的收敛速度，从而降低采样效率。因此，为解决连续状态和动作空间的问题，提出了一种基于模型学习的Actor-Criter算法，该算法具有高斯过程近似器，称为MLAC-GPA，由于其有价值而选择了高斯过程作为建模方法。捕获底层系统的噪声和不确定性的特征。MLAC-GPA中的模型首先由线性函数逼近表示，然后由高斯过程建模。然后，通过贝叶斯推理估计模型参数的期望值向量和协方差矩阵。该模型在学习后用于计划，以加速价值函数和策略的融合。实验上，所提出的方法MLAC-GPA已实现并与三种经典基准（杆平衡，倒立摆和山地车）中的五个代表性方法进行了比较。结果表明，MLAC-GPA在学习率和样本效率上均优于其他方法。该模型在学习后用于计划，以加速价值函数和策略的融合。实验上，所提出的方法MLAC-GPA已实现并与三种经典基准（杆平衡，倒立摆和山地车）中的五个代表性方法进行了比较。结果表明，MLAC-GPA在学习率和样本效率上均优于其他方法。该模型在学习后用于计划，以加速价值函数和策略的融合。实验上，所提出的方法MLAC-GPA得以实施，并与三种经典基准（杆平衡，倒立摆和山地车）中的五个代表性方法进行了比较。结果表明，MLAC-GPA在学习率和样本效率上均优于其他方法。

更新日期：2020-04-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11