Structured Policy Iteration for Linear Quadratic Regulator,arXiv - CS - Artificial Intelligence

当前位置： X-MOL 学术 › arXiv.cs.AI › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Structured Policy Iteration for Linear Quadratic Regulator
arXiv - CS - Artificial Intelligence Pub Date : 2020-07-13 , DOI: arxiv-2007.06202
Youngsuk Park, Ryan A. Rossi, Zheng Wen, Gang Wu, Handong Zhao

Linear quadratic regulator (LQR) is one of the most popular frameworks to tackle continuous Markov decision process tasks. With its fundamental theory and tractable optimal policy, LQR has been revisited and analyzed in recent years, in terms of reinforcement learning scenarios such as the model-free or model-based setting. In this paper, we introduce the \textit{Structured Policy Iteration} (S-PI) for LQR, a method capable of deriving a structured linear policy. Such a structured policy with (block) sparsity or low-rank can have significant advantages over the standard LQR policy: more interpretable, memory-efficient, and well-suited for the distributed setting. In order to derive such a policy, we first cast a regularized LQR problem when the model is known. Then, our Structured Policy Iteration (S-PI) algorithm, which takes a policy evaluation step and a policy improvement step in an iterative manner, can solve this regularized LQR efficiently. We further extend the S-PI algorithm to the model-free setting where a smoothing procedure is adopted to estimate the gradient. In both the known-model and model-free setting, we prove convergence analysis under the proper choice of parameters. Finally, the experiments demonstrate the advantages of S-PI in terms of balancing the LQR performance and level of structure by varying the weight parameter.

中文翻译：

线性二次调节器的结构化策略迭代

线性二次调节器 (LQR) 是处理连续马尔可夫决策过程任务的最流行的框架之一。凭借其基本理论和易处理的最优策略，LQR 近年来在强化学习场景（例如无模型或基于模型的设置）方面得到了重新审视和分析。在本文中，我们介绍了 LQR 的 \textit{结构化策略迭代}（S-PI），这是一种能够推导出结构化线性策略的方法。这种具有（块）稀疏性或低秩的结构化策略与标准 LQR 策略相比具有显着优势：可解释性更强、内存效率更高且非常适合分布式设置。为了推导出这样的策略，我们首先在模型已知时抛出一个正则化的 LQR 问题。然后，我们的结构化策略迭代（S-PI）算法，以迭代的方式进行策略评估步骤和策略改进步骤，可以有效地解决这个正则化的 LQR。我们进一步将 S-PI 算法扩展到无模型设置，其中采用平滑程序来估计梯度。在已知模型和无模型设置中，我们证明了在正确选择参数下的收敛分析。最后，实验证明了 S-PI 在通过改变权重参数来平衡 LQR 性能和结构水平方面的优势。我们在正确选择参数下证明会聚分析。最后，实验证明了 S-PI 在通过改变权重参数来平衡 LQR 性能和结构水平方面的优势。我们在正确选择参数下证明会聚分析。最后，实验证明了 S-PI 在通过改变权重参数来平衡 LQR 性能和结构水平方面的优势。

更新日期：2020-07-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文