当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On the Convergence and Optimality of Policy Gradient for Coherent Risk
arXiv - CS - Machine Learning Pub Date : 2021-03-04 , DOI: arxiv-2103.02827
Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli

In order to model risk aversion in reinforcement learning, an emerging line of research adapts familiar algorithms to optimize coherent risk functionals, a class that includes conditional value-at-risk (CVaR). Because optimizing the coherent risk is difficult in Markov decision processes, recent work tends to focus on the Markov coherent risk (MCR), a time-consistent surrogate. While, policy gradient (PG) updates have been derived for this objective, it remains unclear (i) whether PG finds a global optimum for MCR; (ii) how to estimate the gradient in a tractable manner. In this paper, we demonstrate that, in general, MCR objectives (unlike the expected return) are not gradient dominated and that stationary points are not, in general, guaranteed to be globally optimal. Moreover, we present a tight upper bound on the suboptimality of the learned policy, characterizing its dependence on the nonlinearity of the objective and the degree of risk aversion. Addressing (ii), we propose a practical implementation of PG that uses state distribution reweighting to overcome previous limitations. Through experiments, we demonstrate that when the optimality gap is small, PG can learn risk-sensitive policies. However, we find that instances with large suboptimality gaps are abundant and easy to construct, outlining an important challenge for future research.

中文翻译:

相干风险政策梯度的收敛性与最优性

为了对强化学习中的风险规避进行建模,新兴的研究领域采用了熟悉的算法来优化连贯的风险功能,该类别包括条件风险价值(CVaR)。因为在马尔可夫决策过程中很难优化相干风险,所以最近的工作往往集中在时间一致的替代品马尔可夫相干风险(MCR)上。虽然已经为此目的导出了策略梯度(PG)更新,但仍不清楚(i)PG是否为MCR找到了全局最优值;(ii)如何以易处理的方式估算梯度。在本文中,我们证明,一般而言,MCR目标(与预期收益不同)不是梯度主导的,并且通常不能保证平稳点是全局最优的。而且,我们在学习型策略的次优性上给出了一个严格的上限,描述了它对目标的非线性和风险规避程度的依赖性。针对(ii),我们提出了PG的一种实际实现,该PG使用状态分配重新加权来克服以前的限制。通过实验,我们证明了当最优差距较小时,PG可以学习风险敏感策略。然而,我们发现具有次优性差距的实例非常丰富且易于构建,概述了未来研究的重要挑战。我们证明,当最优差距很小时,PG可以学习风险敏感的策略。然而,我们发现具有次优性差距的实例非常丰富且易于构建,概述了未来研究的重要挑战。我们证明,当最优差距很小时,PG可以学习风险敏感的策略。然而,我们发现具有次优性差距的实例非常丰富且易于构建,概述了未来研究的重要挑战。
更新日期:2021-03-05
down
wechat
bug