当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling
Machine Learning ( IF 7.5 ) Pub Date : 2021-01-04 , DOI: 10.1007/s10994-020-05912-5
L. A. Prashanth , Nathaniel Korda , Rémi Munos

We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an $O(d)$ improvement in complexity in comparison to LSTD, where $d$ is the dimension of the data. We provide non-asymptotic bounds for our proposed method, both in high probability and in expectation, under the assumption that the matrix underlying the LSTD solution is positive definite. The latter assumption can be easily satisfied for the pathwise LSTD variant proposed in [23]. Moreover, we also establish that using our method in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. These rate results coupled with the low computational complexity of our method make it attractive for implementation in big data settings, where $d$ is large. A similar low-complexity alternative for least squares regression is well-known as the stochastic gradient descent (SGD) algorithm. We provide finite-time bounds for SGD. We demonstrate the practicality of our method as an efficient alternative for pathwise LSTD empirically by combining it with the least squares policy iteration (LSPI) algorithm in a traffic signal control application. We also conduct another set of experiments that combines the SA based low-complexity variant for least squares regression with the LinUCB algorithm for contextual bandits, using the large scale news recommendation dataset from Yahoo.

中文翻译:

线性函数逼近的时间差异学习的浓度界限:批量数据和均匀采样的情况

我们提出了一种基于随机近似 (SA) 的方法,使用最小二乘时间差 (LSTD) 算法对样本进行随机化以进行策略评估。我们提出的方案等效于使用线性函数近似运行常规时间差异学习,尽管从给定的数据集中统一挑选样本。与 LSTD 相比,我们的方法使复杂性提高了 $O(d)$,其中 $d$ 是数据的维度。在假设 LSTD 解的矩阵是正定的情况下,我们为我们提出的方法提供了高概率和预期的非渐近边界。对于 [23] 中提出的路径 LSTD 变体,可以很容易地满足后一个假设。而且,我们还确定,使用我们的方法代替 LSTD 不会影响近似值函数向真实值函数的收敛速度。这些速率结果与我们方法的低计算复杂性相结合,使其对于在 $d$ 很大的大数据设置中实现具有吸引力。最小二乘回归的一种类似的低复杂度替代方法是众所周知的随机梯度下降 (SGD) 算法。我们为 SGD 提供有限时间界限。我们通过将其与交通信号控制应用中的最小二乘策略迭代 (LSPI) 算法相结合,凭经验证明了我们的方法作为路径 LSTD 的有效替代方案的实用性。
更新日期:2021-01-04
down
wechat
bug