Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint,Systems & Control Letters

当前位置： X-MOL 学术 › Syst. Control Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint
Systems & Control Letters ( IF 2.1 ) Pub Date : 2021-07-19 , DOI: 10.1016/j.sysconle.2021.104988
Nithia Vijayan ₁ , Prashanth L.A. ₁

Affiliation

We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning (RL) context. Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme. The first algorithm is a straightforward combination of importance sampling-based off-policy evaluation with SF-based gradient estimation. The second algorithm, inspired by the stochastic variance-reduced gradient (SVRG) algorithm, incorporates variance reduction in the update iteration. For both algorithms, we derive non-asymptotic bounds that establish convergence to an approximate stationary point. From these results, we infer that the first algorithm converges at a rate that is comparable to the well-known REINFORCE algorithm in an off-policy RL context, while the second algorithm exhibits an improved rate of convergence.

中文翻译：

用于离策略强化学习的基于平滑函数的梯度算法：非渐近观点

我们提出了两种策略梯度算法，用于解决非策略强化学习 (RL) 上下文中的控制问题。两种算法都包含基于平滑函数 (SF) 的梯度估计方案。第一种算法是基于重要性采样的离策略评估与基于 SF 的梯度估计的直接组合。第二种算法受随机方差减少梯度 (SVRG) 算法的启发，在更新迭代中加入了方差减少。对于这两种算法，我们推导出非渐近边界，该边界建立到近似静止点的收敛。从这些结果中，我们推断第一种算法在离策略强化学习环境中的收敛速度与众所周知的 REINFORCE 算法相当，而第二种算法的收敛速度有所提高。

更新日期：2021-07-19

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11