Abstract
The optimization of continuous action control is an important research field. It aims to find optimal decisions by the experience of making decisions in a continuous action control task. This process can be done via reinforcement learning to train an agent for learning a policy by maximizing cumulative rewards of making decisions in a dynamic environment. Exploration–exploitation tradeoff is a key issue in learning this policy. The current solution called exploration policy addresses this issue by adding exploration noise to the policy in training for more efficient exploration while keeping exploitation. This noise is from a fixed distribution during the training process. However, in the dynamic environment, the stability of training is frequently changed in different training episodes, leading to the low adaptability for exploration policy to training stability. In this paper, we propose an adaptive exploration policy to address exploration–exploitation tradeoff. The motivation is that the noise scale should be increased to enhance exploration when the stability of training is high, while it should be reduced to keep exploitation when the stability of training is low. Firstly, we regard the variance of cumulative rewards from decisions as an index of the training stability. Then, based on this index, we construct a tradeoff coefficient, which is negatively correlated to the training stability. Finally, we propose adaptive exploration policy by the tradeoff coefficient to adjust the added exploration noise for adapting to the training stability. By the theoretical analysis and the experiments, we illustrate the effectiveness of our adaptive exploration policy. The source code can be downloaded from https://github.com/grcai/AEP-algorithm.
Similar content being viewed by others
Data Availability Statement
All data generated or analyzed during this study are available in https://github.com/grcai/AEP-algorithm.
References
Yang ZY, Merrick K, Jin LW, Abbass HA (2018) Hierarchical deep reinforcement learning for continuous action control. IEEE Trans Neural Netw Learn Syst 29(11):5174–5184
Yang ZY, Merrick K, Abbass HA, Jin LW (2017) Multi-Task deep reinforcement learning for continuous action control. In: International joint conference on artificial intelligence, vol 17. Melbourne, Australia, pp 3301–3307. Accessed 19-25 Aug 2017
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Huang TY, Wang SP, Zhu W (2020) An adaptive kernelized rank-order distance for clustering non-spherical data with high noise. Int J Mach Learn Cybern 11:1735–1747
The view from continuous control (2019) A tour of reinforcement learning: the view from continuous control. Annu Rev Control Robot Auton Syst 2:253–279
Gu SX, Holly E, Lillicrap T, levine S (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: 2017 IEEE international conference on robotics and automation, Marina Bay Sands, Singapore, May 29-June 3, 2017, pp 3389–3396
Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274
Sallab AE, Abdou M, Perot E, Yogamani S (2017) Deep reinforcement learning framework for autonomous driving. Electron Imaging 19:70–76
Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv:1610.03295
Folker A, Rick M, Büskens C (2019) Controlling an autonomous vehicle with deep reinforcement learning. In: 2019 IEEE intelligent vehicles symposium, Pairs, France, 9-12 June 2019, pp 2025–2031
Todorov E, Erez T, Tassa Y (2012) Mujoco: a physics engine for model-based control. In: 2012 IEEE international conference on intelligent robots and systems, Vilamoura, Algarve, Portugal, October 7-12, 2012, pp 5026–5033
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. The MIT Press, Cambridge
Li YX (2017) Deep reinforcement learning: an overview. arXiv:1701.07274
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 34(6):26–38
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J et al (2015) Human level control through deep reinforcement learning. Nature 518(7540):529–533
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T et al (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations, San Juan, Puerto Rico, 2-4 May 2016
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, vol 80. Stockholm, Sweden, July 10-15, 2018, pp 1587–1596
Fujimoto S, Meger D, Precup D (2020) An equivalence between loss function and nonuniform sampling in experience replay. In: Advances in neural information processing systems, 6-12 Dec 2020, virtual, pp 14219–14230
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, vol 80. Stockholm, Sweden, 10-15 July 2018, pp 1861–1870
Haarnoja T, Tang HR, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. In: International conference on machine learning, vol 70. Sydney, Australia, 6-11 Aug 2017, pp 1352–1361
Schulman J, Levine S, Moritz P, Jordan M, Abbeel P (2015) Trust region policy optimization. In: International conference on machine learning, vol 37. Lille, France, 6-11 July 2015, pp 1889–1897
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347
Zhang ST, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. In: Advances in neural information processing systems, Vancouver, BC, Canada, 8-14 Dec 2019, pp 2001–2011
Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. In: Advances in neural information processing systems. Denver, USA, pp 1008–1014
Silver D, Lever G, Heess N, Degris T, Wierstra D et al (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, vol 32. Beijing, China, 21-26 June 2014, pp 1387–1395
Schmidhuber J (1991) Curious model-building control system. In: 1991 IEEE international joint conference on neural networks, vol 2. Singapore, 18-21 Nov 1991, pp 1458–1463
Tang H, Houthooft R, Foote D, Stooke A, Chen X et al (2017) Exploration: a study of count-based exploration for deep reinforcement learning. In: Advances in neural information processing systems, CA, USA, 4-9 Dec 2017, pp 2753–2766
Ishii S, Yoshida W, Yoshimoto J (2002) Control of exploitation–exploration meta-parameter in reinforcement learning. Neural Netw 15(4):665–687
Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P (2018) Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE international conference on robotics and automation, Brisbane, Australia, 21-26 May 2018, pp 6292–6299
Osband I, Blundell C, Pritzel A, Roy BV (2016) Deep exploration via bootstrapped DQN. In: Advances in neural information processing systems, Barcelona, Spain, 5-10 Dec 2016, pp 4026–4034
Papini M, Battistello A, Restelli M (2020) Balancing learning speed and stability in policy gradient via adaptive exploration. In: International conference on artificial intelligence and statistics, vol 108. Sicily, Italy, 3-5 June 2020, pp 1188–1199
Cinotti F, Fresno V, Aklil N, Coutureau E, Girard B, Marchand AR, Khamassi M (2019) Dopamine blockade impairs the exploration–exploitation trade-off in rats. Sci Rep 9(1):1–14
Metelli AM, Papini M, Montail N, Restelli M (2020) Importance sampling techniques for policy optimization. J Mach Learn Res 21(141):1–75
Metelli AM, Papini M, D’Oro P, Restelli M (2020) Policy optimization as online learning with mediator feedback. arXiv:2012.08225
Oh J, Guo YJ, Singh S, Lee H (2018) Self-imitation learning. In: International conference on machine learning, vol 80. Stockholm, Sweden, 10-15 July 2018, pp 3878–3887
Jordan MI, Mitchell TM (2015) Machine learning: trends, perspective, and prospects. Science 349(6245):255–260
Cai ZL, Yang XF, Huang TY, Zhu W (2020) A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Inf Sci 508:173–182
Li RJ, Yang XF, Qin XL, Zhu W (2019) Local gap density for clustering high-dimensional data with varying densities. Knowl Based Syst 184:104905
Wang SP, Zhu W (2016) Sparse graph embedding unsupervised feature selection. IEEE Trans Syst Man Cybern Syst 48(3):329–341
Zhu W, Wang FY (2003) Reduction and axiomization of covering generalized rough sets. Inf Sci 152:217–230
Bertsekas DP, Tsitsiklis JN (1995) Neuro-dynamic programming: an overview. In: Proceedings of 1995 34th IEEE conference on decision and control, vol 1. New Orleans, LA, USA, 13-15 Dec 1995, pp 560–564
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
Powell WB (2011) Approximate dynamic programming: solving the curses of dimensionality, 2nd edn. Wiley, Hoboken
Keyhanipour AH, Moshiri B, Rahgozar M, Oroumchian F, Ansari AA (2016) Integration of data fusion and reinforcement learning techniques for the rank-aggregation problem. Int J Mach Learn Cybern 7(6):1131–1145
Gholizade-Narm H, Noori A (2018) Control the population of free viruses in nonlinear uncertain HIV system using Q-learning. Int J Mach Learn Cybern 9(7):1169–1179
Yasini S, Sitani MBN, Kirampor A (2016) Reinforcement learning and neural networks for multi-agent nonzero-sum games of nonlinear constrained input systems. Int J Mach Learn Cybern 7(6):967–980
Song Z, Parr R, Carin L (2019) Revisiting the softmax Bellman operator: new benefits and new perspective. In: International conference on machine learning, vol 97. California, USA, 9-15 June 2019, pp 5916–5925
Stekolshchik R (2020) Noise, overestimation and exploration in deep reinforcement learning. arXiv:2006.14167
Gravell B, Summers T (2020) Robust learning-based control via bootstrapped multiplicative noise. In: Proceedings of machine learning research, vol 120. MA, USA, 11-12 June 2020, pp 599–607. arXiv:2002.10069
Shani L, Effroni Y, Mannor S (2019) Exploration conscious reinforcement learning revisited. In: International conference on machine learning, vol 97. California, USA, 9-15 June 2019, pp 5680–5689
Fortunato M, Azar MG, Piot B, Menick J, Osband I et al (2018) Noisy networks for exploration. In: International conference on learning representations, Vancouver, BC, Canada, April 30-May 3 2018
Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY et al (2018) Parameter space noise for exploration. In: International conference on learning representations, Vancouver, BC, Canada, April 30-May 3 2018
Zheng S, Song Y, Leung T, Goodfellow I (2018) Improving the robustness of deep neural networks via stability training. In: Proceedings of the IEEE conference on computer vison and pattern recognition, NV, USA, 18-23 June 2018, pp 4480–4488
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J et al (2016) OpenAI gym. arXiv:1606.01540
Patro SGK, Sahu KK (2015) Normalization: a preprocessing stage. arXiv:1503.06462
Acknowledgements
This work is supported in part by The National Nature Science Foundation of China under Grant no. 61772120.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, M., Huang, T. & Zhu, W. Adaptive exploration policy for exploration–exploitation tradeoff in continuous action control optimization. Int. J. Mach. Learn. & Cyber. 12, 3491–3501 (2021). https://doi.org/10.1007/s13042-021-01387-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01387-5