当前位置: X-MOL 学术SIAM J. Control Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies
SIAM Journal on Control and Optimization ( IF 2.2 ) Pub Date : 2020-12-03 , DOI: 10.1137/19m1288012
Kaiqing Zhang , Alec Koppel , Hao Zhu , Tamer Başar

SIAM Journal on Control and Optimization, Volume 58, Issue 6, Page 3586-3612, January 2020.
Policy gradient (PG) methods have been one of the most essential ingredients of reinforcement learning, with application in a variety of domains. In spite of the empirical success, a rigorous understanding of the global convergence of PG methods appears to be relatively lacking in the literature, especially for the infinite-horizon setting with discounted factors. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, we propose a new variant of PG methods for infinite-horizon problems that uses a random rollout horizon for the Monte Carlo estimation of the policy gradient. This method then yields an unbiased estimate of the policy gradient with bounded variance, which enables using the tools from nonconvex optimization to establish the global convergence. Employing this perspective, we first point to an alternative method to recover the convergence to stationary-point policies in the literature. Motivated by the recent advances in nonconvex optimization, we have modified the proposed PG method by introducing a periodically enlarged stepsize rule. More interestingly, this modified algorithm is shown to be able to escape saddle points under mild assumptions on the reward functions and the policy parameterization of the reinforcement learning (RL) problem. Specifically, we connect the correlated negative curvature condition of [H. Daneshmand et al., Escaping saddles with stochastic gradients, in Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 2018, pp. 1155--1164] to the fact that the reward must be strictly positive or negative. Under the additional assumption that all saddle points are strict, this result essentially establishes the convergence to actual locally optimal policies of the underlying problem and thus rigorously corroborates the overclaimed argument in the literature on the convergence of PG methods. In this aspect, our findings justify the benefit of reward-reshaping in terms of escaping saddle points from a nonconvex optimization perspective.


中文翻译:

策略梯度方法与(几乎)局部最优策略的全局收敛

SIAM控制与优化杂志,第58卷,第6期,第3586-3612页,2020年1月。
策略梯度(PG)方法已成为强化学习的最基本要素之一,并已在多个领域中得到应用。尽管获得了成功的经验,但文献中似乎仍然缺乏对PG方法的全局收敛性的严格理解,尤其是对于带有折扣因子的无限水平环境。在这项工作中,我们通过从非凸优化角度查看PG方法来缩小差距。特别是,我们提出了一种用于无限水平问题的PG方法的新变体,该方法使用随机推出范围进行策略梯度的蒙特卡洛估计。然后,该方法会产生带有一定方差的策略梯度的无偏估计,从而可以使用非凸优化中的工具来建立全局收敛。运用这种观点,我们首先指出一种替代方法,以恢复文献中对平稳点策略的收敛。基于非凸优化的最新进展,我们通过引入定期扩大的逐步调整规则来修改提出的PG方法。更有趣的是,该改进算法被证明能够在奖励函数和强化学习(RL)问题的策略参数化的温和假设下逃避鞍点。具体来说,我们将[H]的相关负曲率条件连接起来。Daneshmand等人在《国际机器学习会议论文集》(瑞典斯德哥尔摩,2018年,第1155--1164页)中指出,必须随机分配正梯度或负负值。在所有鞍点都严格的附加假设下,此结果实质上建立了对潜在问题的实际局部最优策略的收敛,因此严格证实了文献中关于PG方法收敛的过分论证。在这方面,我们的发现证明了从非凸优化角度逃避鞍点方面重塑奖励的好处。
更新日期:2020-12-04
down
wechat
bug