当前位置: X-MOL 学术arXiv.eess.SY › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation
arXiv - EE - Systems and Control Pub Date : 2023-01-23 , DOI: arxiv-2301.09709
Anna Winnicki, R. Srikant

A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8 of [Tsitsiklis, 2002]). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead [Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.

中文翻译:

基于策略迭代的强化学习与蒙特卡罗策略评估的融合

强化学习中的一种常用技术是从给定策略的蒙特卡洛模拟中评估价值函数,并使用估计的价值函数来获得对估计价值函数贪婪的新策略。在这种情况下,一个众所周知的长期悬而未决的问题是,当策略的价值函数是根据从实施该策略获得的单个样本路径收集的数据估计时,证明这种方案的收敛性(参见 [Sutton 和 Barto 的第 99 页) , 2018], [Tsitsiklis, 2002] 第 8 页)。我们通过展示这种策略迭代方案的首次访问版本确实收敛到最优策略来提出开放问题的解决方案,前提是策略改进步骤使用先行 [Silver et al., 2016, Mnih et al., 2016] , 银等人, 2017b] 而不是简单的贪心策略改进。我们在表格设置中提供了原始开放问题的结果,并提供了函数逼近设置的扩展,其中我们表明算法产生的策略在函数逼近误差内执行接近最优策略。
更新日期:2023-01-25
down
wechat
bug