当前位置: X-MOL 学术IEEE Trans. Autom. Control › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Policy Evaluation in Continuous MDPs with Efficient Kernelized Gradient Temporal Difference
IEEE Transactions on Automatic Control ( IF 6.8 ) Pub Date : 2020-01-01 , DOI: 10.1109/tac.2020.3029315
Alec Koppel , Garrett Warnell , Ethan Stump , Peter Stone , Alejandro Ribeiro

We consider policy evaluation in infinite-horizon discounted Markov decision problems (MDPs) with infinite spaces. We reformulate this task a compositional stochastic program with a function-valued decision variable that belongs to a reproducing kernel Hilbert space (RKHS). We approach this problem via a new functional generalization of stochastic quasi-gradient methods operating in tandem with stochastic sparse subspace projections. The result is an extension of gradient temporal difference learning that yields nonlinearly parameterized value function estimates of the solution to the Bellman evaluation equation. Our main contribution is a memory-efficient non-parametric stochastic method guaranteed to converge exactly to the Bellman fixed point with probability $1$ with attenuating step-sizes. Further, with constant step-sizes, we obtain mean convergence to a neighborhood and that the value function estimates have finite complexity. In the Mountain Car domain, we observe faster convergence to lower Bellman error solutions than existing approaches with a fraction of the required memory.

中文翻译:

具有有效核化梯度时间差异的连续 MDP 中的策略评估

我们考虑具有无限空间的无限水平贴现马尔可夫决策问题 (MDP) 中的策略评估。我们将这个任务重新表述为一个组合随机程序,该程序具有一个属于再生核希尔伯特空间(RKHS)的函数值决策变量。我们通过与随机稀疏子空间投影协同工作的随机准梯度方法的新功能推广来解决这个问题。结果是梯度时间差分学习的扩展,它产生了贝尔曼评估方程解的非线性参数化值函数估计。我们的主要贡献是一种内存高效的非参数随机方法,保证在衰减步长的情况下以概率 $1$ 精确收敛到 Bellman 不动点。此外,在恒定步长下,我们获得了对邻域的平均收敛性,并且价值函数估计具有有限的复杂性。在山地车领域,我们观察到比现有方法更快地收敛到更低的贝尔曼误差解,所需内存的一小部分。
更新日期:2020-01-01
down
wechat
bug