当前位置: X-MOL 学术Int. J. Adapt. Control Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Eligibility traces and forgetting factor in recursive least-squares-based temporal difference
International Journal of Adaptive Control and Signal Processing ( IF 3.1 ) Pub Date : 2021-05-31 , DOI: 10.1002/acs.3282
Simone Baldi 1, 2 , Zichen Zhang 1 , Di Liu 3, 4
Affiliation  

We propose a new reinforcement learning method in the framework of Recursive Least Squares-Temporal Difference (RLS-TD). Instead of using the standard mechanism of eligibility traces (resulting in RLS-TD(urn:x-wiley:acs:media:acs3282:acs3282-math-0001)), we propose to use the forgetting factor commonly used in gradient-based or least-square estimation, and we show that it has a similar role as eligibility traces. An instrumental variable perspective is adopted to formulate the new algorithm, referred to as RLS-TD with forgetting factor (RLS-TD-f). An interesting aspect of the proposed algorithm is that it has an interpretation of a minimizer of an appropriate cost function. We test the effectiveness of the algorithm in a Policy Iteration setting, meaning that we aim to improve the performance of an initially stabilizing control policy (over large portion of the state space). We take a cart-pole benchmark and an adaptive cruise control benchmark as experimental platforms.

中文翻译:

基于递归最小二乘的时间差异中的资格痕迹和遗忘因子

我们在递归最小二乘-时间差 (RLS-TD) 的框架内提出了一种新的强化学习方法。而不是使用资格跟踪的标准机制(导致 RLS-TD(骨灰盒:x-wiley:acs:媒体:acs3282:acs3282-math-0001)),我们建议使用基于梯度或最小二乘估计中常用的遗忘因子,并且我们证明它具有与资格迹相似的作用。采用工具变量的视角来制定新算法,称为带遗忘因子的RLS-TD(RLS-TD-f)。所提出算法的一个有趣的方面是它具有对适当成本函数的最小化的解释。我们在策略迭代设置中测试算法的有效性,这意味着我们旨在提高初始稳定控制策略的性能(在大部分状态空间上)。我们将车杆基准和自适应巡航控制基准作为实验平台。
更新日期:2021-05-31
down
wechat
bug