Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator

Zhong, Shan; Tan, Jack; Dong, Husheng; Chen, Xuemei; Gong, Shengrong; Qian, Zhenjiang

doi:10.1007/s10723-020-09512-4

Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator

Published: 18 April 2020

Volume 18, pages 181–195, (2020)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Shan Zhong ORCID: orcid.org/0000-0003-0034-6952^1,2,3,4,
Jack Tan³,
Husheng Dong⁵,
Xuemei Chen⁶,
Shengrong Gong^1,2 &
…
Zhenjiang Qian^1,2

243 Accesses
8 Citations
Explore all metrics

Abstract

The tasks with continuous state and action spaces are difficult to be solved with high sample efficiency. Model learning and planning, as a well-known method to improve the sample efficiency, is achieved by learning a system dynamics model first and then using it for planning. However, the convergence of the algorithm will be slowed if the system dynamics model is not captured accurately, with the consequence of low sample efficiency. Therefore, to solve the problems with continuous state and action spaces, a model-learning-based actor-critic algorithm with the Gaussian process approximator is proposed, named MLAC-GPA, where the Gaussian process is selected as the modeling method due to its valuable characteristics of capturing the noise and uncertainty of the underlying system. The model in MLAC-GPA is firstly represented by linear function approximation and then modeled by the Gaussian process. Afterward, the expectation value vector and the covariance matrix of the model parameter are estimated by Bayesian reasoning. The model is used for planning after being learned, to accelerate the convergence of the value function and the policy. Experimentally, the proposed method MLAC-GPA is implemented and compared with five representative methods in three classic benchmarks, Pole Balancing, Inverted Pendulum, and Mountain Car. The result shows MLAC-GPA overcomes the others both in learning rate and sample efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review on model predictive control: an engineering perspective

Article Open access 11 August 2021

Max Schwenzer, Muzaffer Ay, … Dirk Abel

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A review of cooperative multi-agent deep reinforcement learning

Article 14 October 2022

Afshin Oroojlooy & Davood Hajinezhad

References

Littman, M.L.: Reinforcement learning improves behaviour from evaluative feedback. Nature 521 (7553), 445–451 (2015)
Article Google Scholar
Kiumarsi, B., Vamvoudakis, K.G., Modares, H., Lewis, F.L.: Optimal and autonomous control using reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst., 1–21 (2017)
Racanière, S., Weber, T., Reichert, D.P., Buesing, L., Guez, A., Rezende, D., Jimenez, A., Badia, P., Vinyals, O., Heess, N., Li, Y., Pascanu, R., Battaglia, P., Hassabis, D., Silver, D., Wierstra, D.: Imagination-augmented agents for deep reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 5694–5705 (2017)
Bellemare, M.G., Ostrovski, G., Guez, A., Thomas, P.S., Munos, R.: Increasing the action gap: New operators for reinforcement learning. In: Proceedings of Workshops at the AAAI Conference on Artificial Intelligence (AAAI), pp 1476–1483 (2016)
Lample, G., Chaplot, D.S.: Playing fps games with deep reinforcement learning. In: Proceedings of Workshops at the AAAI Conference on Artificial Intelligence (AAAI), pp 2140–2146 (2017)
Hwangbo, J., Sa, I., Siegwart, R., Hutter, M.: Control of a quadrotor with reinforcement learning. IEEE Robot. Autom. Lett. 2(4), 2096–2103 (2017)
Article Google Scholar
Nawrocki, P., Sniezynski, B.: Autonomous context-based service optimization in mobile cloud computing. J. Grid Comput. 15(3), 343–356 (2017)
Article Google Scholar
Qureshi, M.B., Dehnavi, M.M., Min-Allah, N., Qureshi, M.S., Hussain, H., Rentifis, I., Tziritas, N., Loukopoulos, T., Khan, S.U., Xu, C.Z.: Survey on grid resource allocation mechanisms. J. Grid Comput. 12(2), 399–441 (2014)
Article Google Scholar
Galstyan, A., Czajkowski, K., Lerman, K.: Resource allocation in the grid with learning agents. J. Grid Comput. 3(1–2), 91–100 (2005)
Article Google Scholar
Cichosz, P., Mulawka, J.J.: Fast and efficient reinforcement lcearning with truncated temporal differences. In: Machine Learning Proceedings 1995, pp 99–107. Elsevier (1995)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press (2014)
Wei, Q., Liu, D., Lewis, F.L., Liu, Y., Zhang, J.: Mixed iterative adaptive dynamic programming for optimal battery energy control in smart residential microgrids. IEEE Trans. Ind. Electron. 64(5), 4110–4120 (2017)
Article Google Scholar
Cohen, E., Beck, J.C.: Problem difficulty and the phase transition in heuristic search. In: Proceedings of Workshops at the AAAI Conference on Artificial Intelligence (AAAI), pp 780–786 (2017)
Goldenberg, M.: The heuristic search research framework. Knowl.-Based Syst. 129, 1–3 (2017)
Article Google Scholar
Gu, S., Lillicrap, T., Sutskever, I., Levine, S.: Continuous deep q-learning with model-based acceleration. In: International Conference on Machine Learning (ICML), pp 2829–2838 (2016)
Jiang, Y, Deng, Z., Chung, F.-L., Wang, G., Qian, P., Choi, K.-S., Wang, S.: Recognition of epileptic eeg signals using a novel multi-view tsk fuzzy system. IEEE Trans. Fuzzy Syst. 25 (1), 3–20 (2017)
Article Google Scholar
Kaijian, X., Gu, X., Zhang, Y.: Oriented grouping constrained spectral clustering for medical imaging segmentation. Multimedia Systems
Rasmussen, C.E.: Gaussian processes in machine learning. In: Advanced Lectures on Machine Learning, pp 63–71. Springer (2004)
Peng, W., Li, Y., Yang, Y., Mi, J., Huang, H.: Bayesian degradation analysis with inverse gaussian process models under cime-varying degradation rates. IEEE Trans. Reliab. 66(1), 84–96 (2017)
Article Google Scholar
You, J., Li, X., Low, M., Lobell, D., Ermon, S.: Deep gaussian process for crop yield prediction based on remote sensing data. In: Proceedings of Workshops at the AAAI Conference on Artificial Intelligence (AAAI), pp 4559–4566 (2017)
Munos, R., Stepleton, T., Harutyunyan, A., Bellemare, M.: Safe and efficient off-policy reinforcement learning. In: Advances in Neural Information Processing Systems(NIPS), pp 1054–1062 (2016)
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., de Freitas, N.: Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations (ICLR) (2017)
Gu, S.X., Lillicrap, T., Ghahramani, Z., Turner, R.E., Schölkopf, B., Levine, S.: Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp 3846–3855 (2017)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning (ICML)
Levine, S., Koltun, V.: Guided policy search. In: International conference on machine learning (ICML), pp. 1–9 (2013)
Levine, S., Abbeel, P.: Learning neural network policies with guided policy search under unknown dynamics. In: Advances in Neural Information Processing Systems (NIPS), pp. 1071–1079 (2014)
Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine Learning Proceedings 1990, pp 216–224. Elsevier (1990)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
MATH Google Scholar
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of Workshops at the AAAI Conference on Artificial Intelligence (AAAI), vol. 16, pp 2094–2100 (2016)
Peng, J., Williams, R.J.: Efficient learning and planning within the dyna framework. Adapt. Behav. 1(4), 437–454 (1993)
Article Google Scholar
Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data and less time. Mach. Learn. 13(1), 103–130 (1993)
Google Scholar
Ko, J., Klein, D.J., Fox, D., Haehnel, D.: Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In: 2007 IEEE International Conference on Robotics and Automation, pp 742–747. IEEE (2007)
Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 283–298 (2008)
Article Google Scholar
Santos, M., López, V., Botella, G., et al.: Dyna-h: A heuristic planning reinforcement learning algorithm applied to role-playing game strategy decision systems. Knowl.-Based Syst. 32, 28–36 (2012)
Article Google Scholar
Zhou, Y., Liu, Q., Fu, Q., Zhang, Z.: Trajectory sampling value iteration: Improved dyna search for mdps. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp 1685–1686. International Foundation for Autonomous Agents and Multiagent Systems (2015)
Sutton, R.S., Szepesvári, C., Geramifard, A., Bowling, M.P.: Dyna-style planning with linear function approximation and prioritized sweeping. arXiv:https://arxiv.org/abs/1206.3285
Grondman, I., Vaandrager, M., Busoniu, L., Babuska, R., Schuitema, E.: Efficient model learning methods for actor–critic control. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 42(3), 591–602 (2012)
Article Google Scholar
Busoniu, L., B, R, Schutter, B.D., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press (2010)
Choi, J.-S., Kim, M.: Single image super-resolution using global regression based on multiple local linear mappings. IEEE Trans. Image Process. 26(3), 1300–1314 (2017)
Article MathSciNet MATH Google Scholar
Nassif, R., Richard, C., Ferrari, A., Sayed, A.H.: Diffusion lms for multitask problems with local linear equality constraints. IEEE Trans. Signal Process. 65(19), 4979–4993 (2017)
Article MathSciNet MATH Google Scholar
Grondman, I., Buşoniu, L., Babuska, R.: Model learning actor-critic algorithms: Performance evaluation in a motion control task. In: 2012 IEEE 51st Annual Conference on Decision and Control (CDC), pp 5272–5277. IEEE (2012)
Costa, B., Caarls, W., Menasché, D.S.: Dyna-mlac: trading computational and sample complexities in actor-critic reinforcement learning. In: 2015 Brazilian Conference on Intelligent Systems (BRACIS), pp 37–42. IEEE (2015)
Boyan, J.A.: Least-squares temporal difference learning. In: ICML, pp 49–56 (1999)
Tagorti, M., Scherrer, B.: On the rate of convergence and error bounds for lstd (λ). In: International Conference on Machine Learning, pp 1521–1529 (2015)
Cheng, Y., Feng, H., Wang, X.: Efficient data use in incremental actor–critic algorithms. Neurocomputing 116, 346–354 (2013)
Article Google Scholar
Tamar, A., Wu, Y., Thomas, G., Levine, S., Abbeel, P.: Value iteration networks. In: Advances in Neural Information Processing Systems (NIPS), pp 2154–2162 (2016)
Zhou, J., Yu, P., Tang, W., Wu, Y.: Efficient online local metric adaptation via negative samples for person re-identification, pp 2439–2447 (2017)
Engel, Y., Mannor, S., Meir, R.: Learning to control an octopus arm with gaussian process temporal difference methods. In: Bayesian Reinforcement Learning with Gaussian Process Temporal Difference Methods, pp 1–46 (2007)
Zhong, S., Liu, Q., Zhang, Z., Fu, Q.: Efficient reinforcement learning in continuous state and action spaces with dyna and policy approximation. Frontiers of Computer Science, 1–21 (2019)

Download references

Acknowledgments

This paper is supported by National Natural Science Foundation of China (61702055, U1764261, 51705021, 61972059, 61773272), Natural Science Foundation of Jiangsu (BK2012616), Science technology program of Jiangsu (BK2015260), High School Natural Foundation of Jiangsu(13KJB520020), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04, 93K172017K18), Suzhou Industrial application of basic research program part (SYG201422, SYG201308).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Changshu Institute of Technology, Changshu, China
Shan Zhong, Shengrong Gong & Zhenjiang Qian
CIT-KIT-UWEC-MU Joint Laboratory of International Cooperation in Information Science, Changshu, China
Shan Zhong, Shengrong Gong & Zhenjiang Qian
School of Computer Science, University of Wisconsin, Eau Claire, WI, USA
Shan Zhong & Jack Tan
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
Shan Zhong
Suzhou Institute of Trade and Commerce, Suzhou, China
Husheng Dong
School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China
Xuemei Chen

Authors

Shan Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Jack Tan
View author publications
You can also search for this author in PubMed Google Scholar
Husheng Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xuemei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shengrong Gong
View author publications
You can also search for this author in PubMed Google Scholar
Zhenjiang Qian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shan Zhong or Xuemei Chen.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Recursive inference of the model parameter

Recall the expression of the parameter posterior: ${\hat {\beta _{t}}} = {\Phi }_{t} Q_{t}{\boldsymbol {x}_{t + 1}}$, ${\hat {\mathbf {P}}_{t}} = {\textbf {I}} - {\Phi }_{t} \boldsymbol {Q}_{t} {\Phi }_{t}^{T}, \boldsymbol {Q}_{t} = {\left ({{\mathbf {\Phi }}_t^T{{\mathbf {\Phi }}_t} + {\Sigma _t}} \right )^{- 1}}$

The matrices Φ_t, Σ_t and $Q_t^{- 1}$ can be written recursively as follows:

$$ {{\mathbf{\Phi }}_{t}} = [{{\mathbf{\Phi }}_{t - 1}}, {\phi_{t}}], $$

(1)

$$ {\boldsymbol {\Sigma}_{t}} = \left[ \begin{array}{l} {\boldsymbol {\Sigma}_{t - 1}} ,0\\ 0 ,{\sigma_{t}^{2}} \end{array} \right]. $$

(2)

$$ \boldsymbol Q_{t}^{- 1} = \left[ \begin{array}{l} \boldsymbol Q_{t - 1}^{- 1} ,{\mathbf{\Phi }}_{t}^{T}{\boldsymbol \phi_{t}}\\ \phi_{t}^{T}{\mathbf{\Phi }}_{t}^{} ,\phi_{t}^{T}\boldsymbol \phi + {\sigma_{t}^{2}} \end{array} \right]. $$

(3)

By using the matrix formula, the inversion of $\boldsymbol Q_t^{- 1}$ can be represented as

$$ {\boldsymbol Q_{t}} = \frac{1}{{{s_{t}}}}\left[ \begin{array}{l} {s_{t}}\boldsymbol Q_{t - 1}^{} + {\boldsymbol g_{t}}\boldsymbol g_{t}^{T} , - {\boldsymbol g_{t}}\\ - \boldsymbol g_{t}^{T} \quad ,\textbf{{I}} \end{array} \right], $$

(4)

where

$$ {\boldsymbol g_{t}} = \boldsymbol Q_{t - 1}^{}\boldsymbol {\Phi}_{t - 1}^{T}{\boldsymbol \phi_{t}}. $$

(5)

$$ {s_{t}} = {\sigma_{t}^{2}} + \boldsymbol \phi_{t}^{T}\boldsymbol \phi_{t}^{} - {(\boldsymbol {\Phi}_{t - 1}^{T}}{\boldsymbol \phi_{t}})^{T}{\boldsymbol g_{t}}. $$

(6)

The expression of ${\hat \upbeta _t}$ can be represented as

$$ \begin{array}{l} {{\hat \upbeta} }_{t} = {{\mathbf{\Phi }}_{t}}{\boldsymbol Q_{t}}{\boldsymbol x_{t + 1}}\\ = \frac{1}{{{s_{t}}}}\left[ {\boldsymbol {\Phi}_{t - 1}^{},{\boldsymbol \phi_{t}}} \right]\left[ \begin{array}{l} {s_{t}}{\boldsymbol Q_{t - 1}} + {\boldsymbol g_{t}}g_{t}^{T} - {\boldsymbol g_{t}}\\ - \boldsymbol g_{t}^{T} 1 \end{array} \right]\left( \begin{array}{l} {\boldsymbol x_{t}}\\ {x_{t + 1}} \end{array} \right)\\ = \boldsymbol {\Phi}_{t - 1}^{}{\boldsymbol Q_{t - 1}}{x_{t}} + \frac{1}{{{s_{t}}}}\left[ {\boldsymbol {\Phi}_{t - 1}^{},{\boldsymbol \phi_{t}}} \right]\left( \begin{array}{l} {g_{t}}\\ - 1 \end{array} \right)\left( {\boldsymbol g_{t}^{T}}, - 1 \right)\left( \begin{array}{l} {\boldsymbol x_{t}}\\ {x_{t + 1}} \end{array} \right)\\ = {{\boldsymbol{\hat \upbeta} }_{t - 1}} + \frac{1}{{{s_{t}}}}(\boldsymbol {\Phi}_{t - 1}^{}{\boldsymbol g_{t}} - {\boldsymbol \phi_{t}})(\boldsymbol g_{t}^{T}{\boldsymbol x_{t}} - {x_{t + 1}})\\ = {{{\hat \upbeta} }_{t - 1}} + \frac{{{\boldsymbol p_{t}}}}{{{s_{t}}}}{d_{t}}, \end{array} $$

(7)

where ${d_t} = {x_{t + 1}} - \boldsymbol g_t^{T}{\boldsymbol x_t}$ and ${\boldsymbol p_t} = {\boldsymbol \phi _t} - \boldsymbol {\Phi }_{t - 1}^{}{\boldsymbol g_t}$

Next, the simple computaion for d_t, p_t and s_t will be derived. Let us begin with d_t:

$$ \begin{array}{l} {d_{t}} = {x_{t + 1}} - g_{t}^{T}{x_{t}}\\ = {x_{t + 1}} - \phi_{t}^{T}{\Phi}_{t - 1}^{}Q_{t - 1}^{}{x_{t}}\\ = {x_{t + 1}} - \boldsymbol \phi_{t}^{T}{{\hat \upbeta} }_{t - 1}. \end{array} $$

(8)

$$ \begin{array}{l} {p_{t}} = {\boldsymbol \phi_{t}} - {\Phi}_{t - 1}^{}{\boldsymbol g_{t}}\\ = {\phi_{t}} - {\Phi}_{t - 1}^{}Q_{t - 1}^{}{\Phi}_{t - 1}^{T}{\phi_{t}}\\ = (\textbf{I} - \boldsymbol {\Phi}_{t - 1}^{}Q_{t - 1}^{}{\Phi}_{t - 1}^{T}){\boldsymbol \phi_{t}}\\ = {\boldsymbol P_{t - 1}}{\boldsymbol \phi_{t}}. \end{array} $$

(9)

$$ \begin{array}{l} s_{t} = {\sigma_{t}^{2}} + \boldsymbol \phi_{t}^{T}\boldsymbol \phi_{t}^{} - {(\boldsymbol {\Phi}_{t - 1}^{T}}{\boldsymbol \phi_{t}})^{T}\boldsymbol Q_{t - 1}^{}\boldsymbol {\Phi}_{t - 1}^{T}{\boldsymbol \phi_{t}}\\ = {\sigma_{t}^{2}} + \boldsymbol \phi_{t}^{T}\boldsymbol \phi_{t}^{} - \boldsymbol \phi_{t}^{T}(I - {\boldsymbol P_{t - 1}}){\boldsymbol \phi_{t}}\\ = {\sigma_{t}^{2}} + \phi_{t}^{T}{\boldsymbol P_{t - 1}}{\boldsymbol \phi_{t}}\\ = {\sigma_{t}^{2}} + \boldsymbol \phi_{t}^{T}{\boldsymbol p_{t}}. \end{array} $$

(10)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhong, S., Tan, J., Dong, H. et al. Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator. J Grid Computing 18, 181–195 (2020). https://doi.org/10.1007/s10723-020-09512-4

Download citation

Received: 27 February 2019
Accepted: 06 February 2020
Published: 18 April 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10723-020-09512-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator

Abstract

Access this article

Similar content being viewed by others

Review on model predictive control: an engineering perspective

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A review of cooperative multi-agent deep reinforcement learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s Note

Appendix: Recursive inference of the model parameter

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator

Abstract

Access this article

Similar content being viewed by others

Review on model predictive control: an engineering perspective

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A review of cooperative multi-agent deep reinforcement learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s Note

Appendix: Recursive inference of the model parameter

Appendix: Recursive inference of the model parameter

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation