A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning,IEEE Transactions on Cybernetics

当前位置： X-MOL 学术 › IEEE Trans. Cybern. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning
IEEE Transactions on Cybernetics ( IF 9.4 ) Pub Date : 2021-09-03 , DOI: 10.1109/tcyb.2021.3104612
Zhiyou Yang ₁ , Hong Qu ₁ , Mingsheng Fu ₁ , Wang Hu ₁ , Yongze Zhao ₁

Affiliation

Model-free reinforcement learning algorithms based on entropy regularized have achieved good performance in control tasks. Those algorithms consider using the entropy-regularized term for the policy to learn a stochastic policy. This work provides a new perspective that aims to explicitly learn a representation of intrinsic information in state transition to obtain a multimodal stochastic policy, for dealing with the tradeoff between exploration and exploitation. We study a class of Markov decision processes (MDPs) with divergence maximization, called divergence MDPs. The goal of the divergence MDPs is to find an optimal stochastic policy that maximizes the sum of both the expected discounted total rewards and a divergence term, where the divergence function learns the implicit information of state transition. Thus, it can provide better-off stochastic policies to improve both in robustness and performance in a high-dimension continuous setting. Under this framework, the optimality equations can be obtained, and then a divergence actor–critic algorithm is developed based on the divergence policy iteration method to address large-scale continuous problems. The experimental results, compared to other methods, show that our approach achieved better performance and robustness in the complex environment particularly. The code of DivAC can be found in https://github.com/yzyvl/DivAC.

中文翻译：

深度强化学习中最优策略的最大发散方法

基于熵正则化的无模型强化学习算法在控制任务中取得了良好的性能。这些算法考虑使用策略的熵正则化项来学习随机策略。这项工作提供了一个新的视角，旨在明确学习状态转换中内在信息的表示，以获得多模态随机策略，用于处理探索和开发之间的权衡。我们研究一类具有散度最大化的马尔可夫决策过程（MDP），称为散度 MDP。发散 MDP 的目标是找到一个最优随机策略，使预期贴现总奖励和发散项之和最大化，其中发散函数学习状态转换的隐含信息。因此，它可以提供更好的随机策略来提高高维连续环境中的鲁棒性和性能。在此框架下，可以获得最优性方程，然后基于发散策略迭代方法开发发散的actor-critic算法来解决大规模连续问题。与其他方法相比，实验结果表明，我们的方法特别是在复杂环境中取得了更好的性能和鲁棒性。 DivAC的代码可以在https://github.com/yzyvl/DivAC中找到。

更新日期：2021-09-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11