Skip to main content

Advertisement

Log in

Adaptive exploration policy for exploration–exploitation tradeoff in continuous action control optimization

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

The optimization of continuous action control is an important research field. It aims to find optimal decisions by the experience of making decisions in a continuous action control task. This process can be done via reinforcement learning to train an agent for learning a policy by maximizing cumulative rewards of making decisions in a dynamic environment. Exploration–exploitation tradeoff is a key issue in learning this policy. The current solution called exploration policy addresses this issue by adding exploration noise to the policy in training for more efficient exploration while keeping exploitation. This noise is from a fixed distribution during the training process. However, in the dynamic environment, the stability of training is frequently changed in different training episodes, leading to the low adaptability for exploration policy to training stability. In this paper, we propose an adaptive exploration policy to address exploration–exploitation tradeoff. The motivation is that the noise scale should be increased to enhance exploration when the stability of training is high, while it should be reduced to keep exploitation when the stability of training is low. Firstly, we regard the variance of cumulative rewards from decisions as an index of the training stability. Then, based on this index, we construct a tradeoff coefficient, which is negatively correlated to the training stability. Finally, we propose adaptive exploration policy by the tradeoff coefficient to adjust the added exploration noise for adapting to the training stability. By the theoretical analysis and the experiments, we illustrate the effectiveness of our adaptive exploration policy. The source code can be downloaded from https://github.com/grcai/AEP-algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability Statement

All data generated or analyzed during this study are available in https://github.com/grcai/AEP-algorithm.

References

  1. Yang ZY, Merrick K, Jin LW, Abbass HA (2018) Hierarchical deep reinforcement learning for continuous action control. IEEE Trans Neural Netw Learn Syst 29(11):5174–5184

    Article  MathSciNet  Google Scholar 

  2. Yang ZY, Merrick K, Abbass HA, Jin LW (2017) Multi-Task deep reinforcement learning for continuous action control. In: International joint conference on artificial intelligence, vol 17. Melbourne, Australia, pp 3301–3307. Accessed 19-25 Aug 2017

  3. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  4. Huang TY, Wang SP, Zhu W (2020) An adaptive kernelized rank-order distance for clustering non-spherical data with high noise. Int J Mach Learn Cybern 11:1735–1747

    Article  Google Scholar 

  5. The view from continuous control (2019) A tour of reinforcement learning: the view from continuous control. Annu Rev Control Robot Auton Syst 2:253–279

    Article  Google Scholar 

  6. Gu SX, Holly E, Lillicrap T, levine S (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: 2017 IEEE international conference on robotics and automation, Marina Bay Sands, Singapore, May 29-June 3, 2017, pp 3389–3396

  7. Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274

    Article  Google Scholar 

  8. Sallab AE, Abdou M, Perot E, Yogamani S (2017) Deep reinforcement learning framework for autonomous driving. Electron Imaging 19:70–76

    Article  Google Scholar 

  9. Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv:1610.03295

  10. Folker A, Rick M, Büskens C (2019) Controlling an autonomous vehicle with deep reinforcement learning. In: 2019 IEEE intelligent vehicles symposium, Pairs, France, 9-12 June 2019, pp 2025–2031

  11. Todorov E, Erez T, Tassa Y (2012) Mujoco: a physics engine for model-based control. In: 2012 IEEE international conference on intelligent robots and systems, Vilamoura, Algarve, Portugal, October 7-12, 2012, pp 5026–5033

  12. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. The MIT Press, Cambridge

    MATH  Google Scholar 

  13. Li YX (2017) Deep reinforcement learning: an overview. arXiv:1701.07274

  14. Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 34(6):26–38

    Article  Google Scholar 

  15. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J et al (2015) Human level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  16. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T et al (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations, San Juan, Puerto Rico, 2-4 May 2016

  17. Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, vol 80. Stockholm, Sweden, July 10-15, 2018, pp 1587–1596

  18. Fujimoto S, Meger D, Precup D (2020) An equivalence between loss function and nonuniform sampling in experience replay. In: Advances in neural information processing systems, 6-12 Dec 2020, virtual, pp 14219–14230

  19. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, vol 80. Stockholm, Sweden, 10-15 July 2018, pp 1861–1870

  20. Haarnoja T, Tang HR, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. In: International conference on machine learning, vol 70. Sydney, Australia, 6-11 Aug 2017, pp 1352–1361

  21. Schulman J, Levine S, Moritz P, Jordan M, Abbeel P (2015) Trust region policy optimization. In: International conference on machine learning, vol 37. Lille, France, 6-11 July 2015, pp 1889–1897

  22. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347

  23. Zhang ST, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. In: Advances in neural information processing systems, Vancouver, BC, Canada, 8-14 Dec 2019, pp 2001–2011

  24. Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. In: Advances in neural information processing systems. Denver, USA, pp 1008–1014

  25. Silver D, Lever G, Heess N, Degris T, Wierstra D et al (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, vol 32. Beijing, China, 21-26 June 2014, pp 1387–1395

  26. Schmidhuber J (1991) Curious model-building control system. In: 1991 IEEE international joint conference on neural networks, vol 2. Singapore, 18-21 Nov 1991, pp 1458–1463

  27. Tang H, Houthooft R, Foote D, Stooke A, Chen X et al (2017) Exploration: a study of count-based exploration for deep reinforcement learning. In: Advances in neural information processing systems, CA, USA, 4-9 Dec 2017, pp 2753–2766

  28. Ishii S, Yoshida W, Yoshimoto J (2002) Control of exploitation–exploration meta-parameter in reinforcement learning. Neural Netw 15(4):665–687

    Article  Google Scholar 

  29. Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P (2018) Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE international conference on robotics and automation, Brisbane, Australia, 21-26 May 2018, pp 6292–6299

  30. Osband I, Blundell C, Pritzel A, Roy BV (2016) Deep exploration via bootstrapped DQN. In: Advances in neural information processing systems, Barcelona, Spain, 5-10 Dec 2016, pp 4026–4034

  31. Papini M, Battistello A, Restelli M (2020) Balancing learning speed and stability in policy gradient via adaptive exploration. In: International conference on artificial intelligence and statistics, vol 108. Sicily, Italy, 3-5 June 2020, pp 1188–1199

  32. Cinotti F, Fresno V, Aklil N, Coutureau E, Girard B, Marchand AR, Khamassi M (2019) Dopamine blockade impairs the exploration–exploitation trade-off in rats. Sci Rep 9(1):1–14

    Article  Google Scholar 

  33. Metelli AM, Papini M, Montail N, Restelli M (2020) Importance sampling techniques for policy optimization. J Mach Learn Res 21(141):1–75

    MathSciNet  MATH  Google Scholar 

  34. Metelli AM, Papini M, D’Oro P, Restelli M (2020) Policy optimization as online learning with mediator feedback. arXiv:2012.08225

  35. Oh J, Guo YJ, Singh S, Lee H (2018) Self-imitation learning. In: International conference on machine learning, vol 80. Stockholm, Sweden, 10-15 July 2018, pp 3878–3887

  36. Jordan MI, Mitchell TM (2015) Machine learning: trends, perspective, and prospects. Science 349(6245):255–260

    Article  MathSciNet  Google Scholar 

  37. Cai ZL, Yang XF, Huang TY, Zhu W (2020) A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Inf Sci 508:173–182

    Article  MathSciNet  Google Scholar 

  38. Li RJ, Yang XF, Qin XL, Zhu W (2019) Local gap density for clustering high-dimensional data with varying densities. Knowl Based Syst 184:104905

    Article  Google Scholar 

  39. Wang SP, Zhu W (2016) Sparse graph embedding unsupervised feature selection. IEEE Trans Syst Man Cybern Syst 48(3):329–341

    Article  Google Scholar 

  40. Zhu W, Wang FY (2003) Reduction and axiomization of covering generalized rough sets. Inf Sci 152:217–230

    Article  MathSciNet  Google Scholar 

  41. Bertsekas DP, Tsitsiklis JN (1995) Neuro-dynamic programming: an overview. In: Proceedings of 1995 34th IEEE conference on decision and control, vol 1. New Orleans, LA, USA, 13-15 Dec 1995, pp 560–564

  42. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285

    Article  Google Scholar 

  43. Powell WB (2011) Approximate dynamic programming: solving the curses of dimensionality, 2nd edn. Wiley, Hoboken

    Book  Google Scholar 

  44. Keyhanipour AH, Moshiri B, Rahgozar M, Oroumchian F, Ansari AA (2016) Integration of data fusion and reinforcement learning techniques for the rank-aggregation problem. Int J Mach Learn Cybern 7(6):1131–1145

    Article  Google Scholar 

  45. Gholizade-Narm H, Noori A (2018) Control the population of free viruses in nonlinear uncertain HIV system using Q-learning. Int J Mach Learn Cybern 9(7):1169–1179

    Article  Google Scholar 

  46. Yasini S, Sitani MBN, Kirampor A (2016) Reinforcement learning and neural networks for multi-agent nonzero-sum games of nonlinear constrained input systems. Int J Mach Learn Cybern 7(6):967–980

    Article  Google Scholar 

  47. Song Z, Parr R, Carin L (2019) Revisiting the softmax Bellman operator: new benefits and new perspective. In: International conference on machine learning, vol 97. California, USA, 9-15 June 2019, pp 5916–5925

  48. Stekolshchik R (2020) Noise, overestimation and exploration in deep reinforcement learning. arXiv:2006.14167

  49. Gravell B, Summers T (2020) Robust learning-based control via bootstrapped multiplicative noise. In: Proceedings of machine learning research, vol 120. MA, USA, 11-12 June 2020, pp 599–607. arXiv:2002.10069

  50. Shani L, Effroni Y, Mannor S (2019) Exploration conscious reinforcement learning revisited. In: International conference on machine learning, vol 97. California, USA, 9-15 June 2019, pp 5680–5689

  51. Fortunato M, Azar MG, Piot B, Menick J, Osband I et al (2018) Noisy networks for exploration. In: International conference on learning representations, Vancouver, BC, Canada, April 30-May 3 2018

  52. Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY et al (2018) Parameter space noise for exploration. In: International conference on learning representations, Vancouver, BC, Canada, April 30-May 3 2018

  53. Zheng S, Song Y, Leung T, Goodfellow I (2018) Improving the robustness of deep neural networks via stability training. In: Proceedings of the IEEE conference on computer vison and pattern recognition, NV, USA, 18-23 June 2018, pp 4480–4488

  54. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J et al (2016) OpenAI gym. arXiv:1606.01540

  55. Patro SGK, Sahu KK (2015) Normalization: a preprocessing stage. arXiv:1503.06462

Download references

Acknowledgements

This work is supported in part by The National Nature Science Foundation of China under Grant no. 61772120.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to William Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, M., Huang, T. & Zhu, W. Adaptive exploration policy for exploration–exploitation tradeoff in continuous action control optimization. Int. J. Mach. Learn. & Cyber. 12, 3491–3501 (2021). https://doi.org/10.1007/s13042-021-01387-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-021-01387-5

Keywords

Navigation