当前位置: X-MOL 学术Neurocomputing › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Trust region policy optimization via entropy regularization for Kullback–Leibler divergence constraint
Neurocomputing ( IF 6 ) Pub Date : 2024-04-16 , DOI: 10.1016/j.neucom.2024.127716
Haotian Xu , Junyu Xuan , Guangquan Zhang , Jie Lu

Trust region policy optimization (TRPO) is one of the landmark policy optimization algorithms in deep reinforcement learning. Its purpose is to maximize a surrogate objective based on an advantage function, subject to the limited Kullback–Leibler (KL) divergence of two consecutive policies. Although there have been many successful applications of this algorithm in the literature, the approach has often been criticized for suppressing the exploration ability of some application environments due to its strict divergence constraint. As such, most researchers prefer to use entropy regularization, which is added to the expected discounted rewards or the surrogate objectives. That said, there is much debate about whether there might be an alternative strategy for regularizing TRPOs. In this paper, we present just that. Our strategy is to regularize the KL divergence-based constraint via Shannon entropy. This approach enlarges the difference between two consecutive policies and thus derives a new TRPO scheme with entropy regularization for use with KL divergence constraint. Next, the surrogate objective and Shannon entropy are approximated linearly, while the KL divergence is expanded quadratically. An efficient conjugate gradient optimization procedure then solves two sets of linear equations, providing a detailed code-level implementation that can be used for a fair experimental comparison. Extensive experiments within eight benchmark environments demonstrate that our proposed method is superior to both the original TRPO and the entropy regularized objective TRPO. Further, theoretical and experimental analysis shows that three TRPO-like methods have an equal time complexity and a close computational burden.

中文翻译:

通过熵正则化实现 Kullback-Leibler 散度约束的信任域策略优化

信任域策略优化(TRPO)是深度强化学习中具有里程碑意义的策略优化算法之一。其目的是根据两个连续策略的有限 Kullback-Leibler (KL) 差异,最大化基于优势函数的替代目标。尽管文献中该算法已有许多成功的应用,但该方法因其严格的散度约束而经常被批评抑制某些应用环境的探索能力。因此,大多数研究人员更喜欢使用熵正则化,将其添加到预期折扣奖励或替代目标中。也就是说,关于是否存在替代策略来规范 TRPO 存在很多争论。在本文中,我们将介绍这一点。我们的策略是通过香农熵来规范基于 KL 散度的约束。这种方法扩大了两个连续策略之间的差异,从而导出了一种新的具有熵正则化的 TRPO 方案,用于 KL 散度约束。接下来,代理目标和香农熵被线性近似,而 KL 散度被二次扩展。然后,高效的共轭梯度优化过程求解两组线性方程,提供详细的代码级实现,可用于公平的实验比较。八个基准环境中的大量实验表明,我们提出的方法优于原始 TRPO 和熵正则化目标 TRPO。此外,理论和实验分析表明,三种类似 TRPO 的方法具有相同的时间复杂度和接近的计算负担。
更新日期:2024-04-16
down
wechat
bug