当前位置: X-MOL 学术Int. J. Mach. Learn. & Cyber. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Adaptive exploration policy for exploration–exploitation tradeoff in continuous action control optimization
International Journal of Machine Learning and Cybernetics ( IF 5.6 ) Pub Date : 2021-08-10 , DOI: 10.1007/s13042-021-01387-5
Min Li 1 , Tianyi Huang 1 , William Zhu 1
Affiliation  

The optimization of continuous action control is an important research field. It aims to find optimal decisions by the experience of making decisions in a continuous action control task. This process can be done via reinforcement learning to train an agent for learning a policy by maximizing cumulative rewards of making decisions in a dynamic environment. Exploration–exploitation tradeoff is a key issue in learning this policy. The current solution called exploration policy addresses this issue by adding exploration noise to the policy in training for more efficient exploration while keeping exploitation. This noise is from a fixed distribution during the training process. However, in the dynamic environment, the stability of training is frequently changed in different training episodes, leading to the low adaptability for exploration policy to training stability. In this paper, we propose an adaptive exploration policy to address exploration–exploitation tradeoff. The motivation is that the noise scale should be increased to enhance exploration when the stability of training is high, while it should be reduced to keep exploitation when the stability of training is low. Firstly, we regard the variance of cumulative rewards from decisions as an index of the training stability. Then, based on this index, we construct a tradeoff coefficient, which is negatively correlated to the training stability. Finally, we propose adaptive exploration policy by the tradeoff coefficient to adjust the added exploration noise for adapting to the training stability. By the theoretical analysis and the experiments, we illustrate the effectiveness of our adaptive exploration policy. The source code can be downloaded from https://github.com/grcai/AEP-algorithm.



中文翻译:

连续动作控制优化中探索-开发权衡的自适应探索策略

连续动作控制的优化是一个重要的研究领域。它旨在通过在连续动作控制任务中做出决策的经验来找到最佳决策。这个过程可以通过强化学习来完成,通过最大化在动态环境中做出决策的累积奖励来训练智能体来学习策略。探索-利用权衡是学习此策略的关键问题。当前称为探索策略的解决方案通过在训练策略中添加探索噪声来解决这个问题,以便在保持开发的同时进行更有效的探索。这种噪声来自训练过程中的固定分布。然而,在动态环境中,训练的稳定性在不同的训练情节中经常发生变化,导致探索策略对训练稳定性的适应性低。在本文中,我们提出了一种自适应探索策略来解决探索-开发权衡问题。其动机是在训练稳定性高时应增加噪声尺度以增强探索,而在训练稳定性低时应减小噪声尺度以保持探索。首先,我们将决策累积奖励的方差作为训练稳定性的指标。然后,基于该指标,我们构建了一个权衡系数,该系数与训练稳定性呈负相关。最后,我们通过权衡系数提出自适应探索策略,以调整添加的探索噪声以适应训练稳定性。通过理论分析和实验,我们说明了我们的自适应探索策略的有效性。源代码可从 https://github.com/grcai/AEP-algorithm 下载。

更新日期:2021-08-10
down
wechat
bug