Deep Deterministic Policy Gradient With Compatible Critic Network,IEEE Transactions on Neural Networks and Learning Systems

当前位置： X-MOL 学术 › IEEE Trans. Neural Netw. Learn. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep Deterministic Policy Gradient With Compatible Critic Network
IEEE Transactions on Neural Networks and Learning Systems ( IF 10.4 ) Pub Date : 2021-10-15 , DOI: 10.1109/tnnls.2021.3117790
Di Wang ₁ , Mengqi Hu ₁

Affiliation

Deep deterministic policy gradient (DDPG) is a powerful reinforcement learning algorithm for large-scale continuous controls. DDPG runs the back-propagation from the state-action value function to the actor network’s parameters directly, which raises a big challenge for the compatibility of the critic network. This compatibility emphasizes that the policy evaluation is compatible with the policy improvement. As proved in deterministic policy gradient, the compatible function guarantees the convergence ability but restricts the form of the critic network tightly. The complexities and limitations of the compatible function impede its development in DDPG. This article introduces neural networks’ similarity indices with gradients to measure the compatibility concretely. Represented as kernel matrices, we consider the actor network’s and the critic network’s training dataset, trained parameters, and gradients. With the sketching trick, the calculation time of the similarity index decreases hugely. The centered kernel alignment index and the normalized Bures similarity index provide us with consistent compatibility scores empirically. Moreover, we demonstrate the necessity of the compatible critic network in DDPG from three aspects: 1) analyzing the policy improvement/evaluation steps; 2) conducting the theoretic analysis; and 3) showing the experimental results. Following our research, we remodel the compatible function with an energy function model, enabling it suitable to the sizeable state-action space problem. The critic network has higher compatibility scores and better performance by introducing the policy change information into the critic-network optimization process. Besides, based on our experiment observations, we propose a light-computation overestimation solution. To prove our algorithm’s performance and validate the compatibility of the critic network, we compare our algorithm with six state-of-the-art algorithms using seven PyBullet robotics environments.

中文翻译：

具有兼容批评网络的深度确定性策略梯度

深度确定性策略梯度（DDPG）是一种用于大规模连续控制的强大强化学习算法。DDPG直接运行从状态-动作值函数到行动者网络参数的反向传播，这对批评者网络的兼容性提出了很大的挑战。这种相容性强调的是政策评估与政策改进的相容性。正如确定性策略梯度所证明的那样，兼容函数保证了收敛能力，但严格限制了批评者网络的形式。兼容功能的复杂性和局限性阻碍了其在DDPG中的发展。本文引入神经网络的梯度相似度指数来具体衡量兼容性。我们以核矩阵表示，考虑参与者网络和批评者网络的训练数据集、训练参数和梯度。通过草图技巧，相似度指数的计算时间大大减少。居中的内核对齐指数和归一化的 Bures 相似性指数为我们提供了经验上一致的兼容性分数。此外，我们从三个方面论证了DDPG中兼容批评者网络的必要性：1）分析政策改进/评估步骤；2）进行理论分析；3）显示实验结果。经过研究，我们用能量函数模型重构了兼容函数，使其适用于相当大的状态-动作空间问题。通过将策略变化信息引入批评者网络优化过程，批评者网络具有更高的兼容性分数和更好的性能。此外，根据我们的实验观察，我们提出了一种轻计算高估解决方案。为了证明我们的算法的性能并验证批评者网络的兼容性，我们使用七个 PyBullet 机器人环境将我们的算法与六种最先进的算法进行比较。

更新日期：2021-10-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>