Adaptive exploration policy for exploration–exploitation tradeoff in continuous action control optimization

Li, Min; Huang, Tianyi; Zhu, William

doi:10.1007/s13042-021-01387-5

Adaptive exploration policy for exploration–exploitation tradeoff in continuous action control optimization

Original Article
Published: 10 August 2021

Volume 12, pages 3491–3501, (2021)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

512 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

The optimization of continuous action control is an important research field. It aims to find optimal decisions by the experience of making decisions in a continuous action control task. This process can be done via reinforcement learning to train an agent for learning a policy by maximizing cumulative rewards of making decisions in a dynamic environment. Exploration–exploitation tradeoff is a key issue in learning this policy. The current solution called exploration policy addresses this issue by adding exploration noise to the policy in training for more efficient exploration while keeping exploitation. This noise is from a fixed distribution during the training process. However, in the dynamic environment, the stability of training is frequently changed in different training episodes, leading to the low adaptability for exploration policy to training stability. In this paper, we propose an adaptive exploration policy to address exploration–exploitation tradeoff. The motivation is that the noise scale should be increased to enhance exploration when the stability of training is high, while it should be reduced to keep exploitation when the stability of training is low. Firstly, we regard the variance of cumulative rewards from decisions as an index of the training stability. Then, based on this index, we construct a tradeoff coefficient, which is negatively correlated to the training stability. Finally, we propose adaptive exploration policy by the tradeoff coefficient to adjust the added exploration noise for adapting to the training stability. By the theoretical analysis and the experiments, we illustrate the effectiveness of our adaptive exploration policy. The source code can be downloaded from https://github.com/grcai/AEP-algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-agent deep reinforcement learning: a survey

Article Open access 15 April 2021

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Data Availability Statement

All data generated or analyzed during this study are available in https://github.com/grcai/AEP-algorithm.

References

Yang ZY, Merrick K, Jin LW, Abbass HA (2018) Hierarchical deep reinforcement learning for continuous action control. IEEE Trans Neural Netw Learn Syst 29(11):5174–5184
Article MathSciNet Google Scholar
Yang ZY, Merrick K, Abbass HA, Jin LW (2017) Multi-Task deep reinforcement learning for continuous action control. In: International joint conference on artificial intelligence, vol 17. Melbourne, Australia, pp 3301–3307. Accessed 19-25 Aug 2017
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Article Google Scholar
Huang TY, Wang SP, Zhu W (2020) An adaptive kernelized rank-order distance for clustering non-spherical data with high noise. Int J Mach Learn Cybern 11:1735–1747
Article Google Scholar
The view from continuous control (2019) A tour of reinforcement learning: the view from continuous control. Annu Rev Control Robot Auton Syst 2:253–279
Article Google Scholar
Gu SX, Holly E, Lillicrap T, levine S (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: 2017 IEEE international conference on robotics and automation, Marina Bay Sands, Singapore, May 29-June 3, 2017, pp 3389–3396
Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274
Article Google Scholar
Sallab AE, Abdou M, Perot E, Yogamani S (2017) Deep reinforcement learning framework for autonomous driving. Electron Imaging 19:70–76
Article Google Scholar
Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv:1610.03295
Folker A, Rick M, Büskens C (2019) Controlling an autonomous vehicle with deep reinforcement learning. In: 2019 IEEE intelligent vehicles symposium, Pairs, France, 9-12 June 2019, pp 2025–2031
Todorov E, Erez T, Tassa Y (2012) Mujoco: a physics engine for model-based control. In: 2012 IEEE international conference on intelligent robots and systems, Vilamoura, Algarve, Portugal, October 7-12, 2012, pp 5026–5033
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. The MIT Press, Cambridge
MATH Google Scholar
Li YX (2017) Deep reinforcement learning: an overview. arXiv:1701.07274
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 34(6):26–38
Article Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J et al (2015) Human level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T et al (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations, San Juan, Puerto Rico, 2-4 May 2016
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, vol 80. Stockholm, Sweden, July 10-15, 2018, pp 1587–1596
Fujimoto S, Meger D, Precup D (2020) An equivalence between loss function and nonuniform sampling in experience replay. In: Advances in neural information processing systems, 6-12 Dec 2020, virtual, pp 14219–14230
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, vol 80. Stockholm, Sweden, 10-15 July 2018, pp 1861–1870
Haarnoja T, Tang HR, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. In: International conference on machine learning, vol 70. Sydney, Australia, 6-11 Aug 2017, pp 1352–1361
Schulman J, Levine S, Moritz P, Jordan M, Abbeel P (2015) Trust region policy optimization. In: International conference on machine learning, vol 37. Lille, France, 6-11 July 2015, pp 1889–1897
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347
Zhang ST, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. In: Advances in neural information processing systems, Vancouver, BC, Canada, 8-14 Dec 2019, pp 2001–2011
Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. In: Advances in neural information processing systems. Denver, USA, pp 1008–1014
Silver D, Lever G, Heess N, Degris T, Wierstra D et al (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, vol 32. Beijing, China, 21-26 June 2014, pp 1387–1395
Schmidhuber J (1991) Curious model-building control system. In: 1991 IEEE international joint conference on neural networks, vol 2. Singapore, 18-21 Nov 1991, pp 1458–1463
Tang H, Houthooft R, Foote D, Stooke A, Chen X et al (2017) Exploration: a study of count-based exploration for deep reinforcement learning. In: Advances in neural information processing systems, CA, USA, 4-9 Dec 2017, pp 2753–2766
Ishii S, Yoshida W, Yoshimoto J (2002) Control of exploitation–exploration meta-parameter in reinforcement learning. Neural Netw 15(4):665–687
Article Google Scholar
Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P (2018) Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE international conference on robotics and automation, Brisbane, Australia, 21-26 May 2018, pp 6292–6299
Osband I, Blundell C, Pritzel A, Roy BV (2016) Deep exploration via bootstrapped DQN. In: Advances in neural information processing systems, Barcelona, Spain, 5-10 Dec 2016, pp 4026–4034
Papini M, Battistello A, Restelli M (2020) Balancing learning speed and stability in policy gradient via adaptive exploration. In: International conference on artificial intelligence and statistics, vol 108. Sicily, Italy, 3-5 June 2020, pp 1188–1199
Cinotti F, Fresno V, Aklil N, Coutureau E, Girard B, Marchand AR, Khamassi M (2019) Dopamine blockade impairs the exploration–exploitation trade-off in rats. Sci Rep 9(1):1–14
Article Google Scholar
Metelli AM, Papini M, Montail N, Restelli M (2020) Importance sampling techniques for policy optimization. J Mach Learn Res 21(141):1–75
MathSciNet MATH Google Scholar
Metelli AM, Papini M, D’Oro P, Restelli M (2020) Policy optimization as online learning with mediator feedback. arXiv:2012.08225
Oh J, Guo YJ, Singh S, Lee H (2018) Self-imitation learning. In: International conference on machine learning, vol 80. Stockholm, Sweden, 10-15 July 2018, pp 3878–3887
Jordan MI, Mitchell TM (2015) Machine learning: trends, perspective, and prospects. Science 349(6245):255–260
Article MathSciNet Google Scholar
Cai ZL, Yang XF, Huang TY, Zhu W (2020) A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Inf Sci 508:173–182
Article MathSciNet Google Scholar
Li RJ, Yang XF, Qin XL, Zhu W (2019) Local gap density for clustering high-dimensional data with varying densities. Knowl Based Syst 184:104905
Article Google Scholar
Wang SP, Zhu W (2016) Sparse graph embedding unsupervised feature selection. IEEE Trans Syst Man Cybern Syst 48(3):329–341
Article Google Scholar
Zhu W, Wang FY (2003) Reduction and axiomization of covering generalized rough sets. Inf Sci 152:217–230
Article MathSciNet Google Scholar
Bertsekas DP, Tsitsiklis JN (1995) Neuro-dynamic programming: an overview. In: Proceedings of 1995 34th IEEE conference on decision and control, vol 1. New Orleans, LA, USA, 13-15 Dec 1995, pp 560–564
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
Article Google Scholar
Powell WB (2011) Approximate dynamic programming: solving the curses of dimensionality, 2nd edn. Wiley, Hoboken
Book Google Scholar
Keyhanipour AH, Moshiri B, Rahgozar M, Oroumchian F, Ansari AA (2016) Integration of data fusion and reinforcement learning techniques for the rank-aggregation problem. Int J Mach Learn Cybern 7(6):1131–1145
Article Google Scholar
Gholizade-Narm H, Noori A (2018) Control the population of free viruses in nonlinear uncertain HIV system using Q-learning. Int J Mach Learn Cybern 9(7):1169–1179
Article Google Scholar
Yasini S, Sitani MBN, Kirampor A (2016) Reinforcement learning and neural networks for multi-agent nonzero-sum games of nonlinear constrained input systems. Int J Mach Learn Cybern 7(6):967–980
Article Google Scholar
Song Z, Parr R, Carin L (2019) Revisiting the softmax Bellman operator: new benefits and new perspective. In: International conference on machine learning, vol 97. California, USA, 9-15 June 2019, pp 5916–5925
Stekolshchik R (2020) Noise, overestimation and exploration in deep reinforcement learning. arXiv:2006.14167
Gravell B, Summers T (2020) Robust learning-based control via bootstrapped multiplicative noise. In: Proceedings of machine learning research, vol 120. MA, USA, 11-12 June 2020, pp 599–607. arXiv:2002.10069
Shani L, Effroni Y, Mannor S (2019) Exploration conscious reinforcement learning revisited. In: International conference on machine learning, vol 97. California, USA, 9-15 June 2019, pp 5680–5689
Fortunato M, Azar MG, Piot B, Menick J, Osband I et al (2018) Noisy networks for exploration. In: International conference on learning representations, Vancouver, BC, Canada, April 30-May 3 2018
Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY et al (2018) Parameter space noise for exploration. In: International conference on learning representations, Vancouver, BC, Canada, April 30-May 3 2018
Zheng S, Song Y, Leung T, Goodfellow I (2018) Improving the robustness of deep neural networks via stability training. In: Proceedings of the IEEE conference on computer vison and pattern recognition, NV, USA, 18-23 June 2018, pp 4480–4488
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J et al (2016) OpenAI gym. arXiv:1606.01540
Patro SGK, Sahu KK (2015) Normalization: a preprocessing stage. arXiv:1503.06462

Download references

Acknowledgements

This work is supported in part by The National Nature Science Foundation of China under Grant no. 61772120.

Author information

Authors and Affiliations

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
Min Li, Tianyi Huang & William Zhu

Authors

Min Li
View author publications
You can also search for this author in PubMed Google Scholar
Tianyi Huang
View author publications
You can also search for this author in PubMed Google Scholar
William Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Huang, T. & Zhu, W. Adaptive exploration policy for exploration–exploitation tradeoff in continuous action control optimization. Int. J. Mach. Learn. & Cyber. 12, 3491–3501 (2021). https://doi.org/10.1007/s13042-021-01387-5

Download citation

Received: 22 September 2020
Accepted: 19 July 2021
Published: 10 August 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s13042-021-01387-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive exploration policy for exploration–exploitation tradeoff in continuous action control optimization

Abstract

Access this article

Similar content being viewed by others

Multi-agent deep reinforcement learning: a survey

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Adaptive exploration policy for exploration–exploitation tradeoff in continuous action control optimization

Abstract

Access this article

Similar content being viewed by others

Multi-agent deep reinforcement learning: a survey

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation