Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning,Automatica

当前位置： X-MOL 学术 › Automatica › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning
Automatica ( IF 6.4 ) Pub Date : 2022-09-28 , DOI: 10.1016/j.automatica.2022.110623
Zaiwei Chen , Sheng Zhang , Thinh T. Doan , John-Paul Clarke , Siva Theja Maguluri

Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., $α_{k} \equiv α$ ), the algorithm achieves exponential fast convergence to a neighborhood (with radius $O (α log (1 / α))$ ) around the desired limit point. When using diminishing stepsizes with appropriate decay rate, the algorithm converges with rate $O (log (k) / k)$ . Our proof is based on Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain. To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular $Q$ -learning algorithm with linear function approximation, under a condition on the behavior policy. Importantly, we do not need to make the assumption that the samples are i.i.d., and do not require an artificial projection step in the algorithm. Numerical simulations corroborate our theoretical results.

中文翻译：

非线性随机逼近的有限样本分析及其在强化学习中的应用

受强化学习 (RL) 应用的启发，我们研究了马尔可夫噪声下的非线性随机逼近 (SA) 算法，并建立了其在各种步长下的有限样本收敛边界。具体来说，我们表明，当使用恒定步长（即， $α_{ķ} \equiv α$ )，算法实现了对邻域的指数快速收敛（半径为 $○ (α 日志 (1 / α))$ ) 在所需的极限点附近。当使用具有适当衰减率的递减步长时，算法以速率收敛 $○ (日志 (ķ) / ķ)$ . 我们的证明基于 Lyapunov 漂移参数，为了处理马尔可夫噪声，我们利用了底层马尔可夫链的快速混合。为了证明我们在马尔可夫 SA 上的理论结果的普遍性，我们用它来推导流行的有限样本界限 $问$ -在行为策略的条件下，具有线性函数逼近的学习算法。重要的是，我们不需要假设样本是独立同分布的，也不需要算法中的人工投影步骤。数值模拟证实了我们的理论结果。

更新日期：2022-09-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>