当前位置: X-MOL 学术ACM Trans. Intell. Syst. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning
ACM Transactions on Intelligent Systems and Technology ( IF 5 ) Pub Date : 2021-06-03 , DOI: 10.1145/3452008
Shilei Li 1 , Meng Li 2 , Jiongming Su 3 , Shaofei Chen 3 , Zhimin Yuan 1 , Qing Ye 1
Affiliation  

Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

中文翻译:

PP-PG:将参数扰动与策略梯度方法相结合,在深度强化学习中进行有效和高效的探索

高效和稳定的探索仍然是在高维动作和状态空间中运行的深度强化学习 (DRL) 的关键挑战。最近,已经提出了一种更有前途的方法,将动作空间中的探索与参数空间中的探索相结合,以充分利用这两种方法。在本文中,我们提出了一个新的迭代闭环框架,通过将进化算法 (EA) 和深度确定性策略梯度 ( DDPG)强化学习算法,它在动作空间中以基于梯度的方式进行探索,使这两种方法以更加平衡和有效的方式协同工作。在我们的框架中,由 EA 种群(参数扰动部分)表示的策略可以通过利用 DDPG 提供的梯度信息以引导方式进化,并且策略梯度部分(DDPG)仅用作最佳个体的微调工具EA 种群以提高样本效率。特别是,我们提出了一个标准来确定 DDPG 所需的训练步骤,以确保可以从 EA 生成的样本中生成有用的梯度信息,并且 DDPG 和 EA 部分可以在每一代期间以更平衡的方式协同工作。此外,在 DDPG 部分,我们的算法可以根据不同情况灵活地在微调相同的先前 RL-Actor 和微调 EA 生成的新 RL-Actor 之间切换,以进一步提高效率。
更新日期:2021-06-03
down
wechat
bug