Policy Iteration Q-Learning for Data-Based Two-Player Zero-Sum Game of Linear Discrete-Time Systems,IEEE Transactions on Cybernetics

当前位置： X-MOL 学术 › IEEE Trans. Cybern. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Policy Iteration Q-Learning for Data-Based Two-Player Zero-Sum Game of Linear Discrete-Time Systems
IEEE Transactions on Cybernetics ( IF 11.8 ) Pub Date : 2020-02-20 , DOI: 10.1109/tcyb.2020.2970969
Biao Luo , Yin Yang , Derong Liu

In this article, the data-based two-player zero-sum game problem is considered for linear discrete-time systems. This problem theoretically depends on solving the discrete-time game algebraic Riccati equation (DTGARE), while it requires complete system dynamics. To avoid solving the DTGARE, the

$Q$

-function is introduced and a data-based policy iteration

$Q$

-learning (PIQL) algorithm is developed to learn the optimal

$Q$

-function by using data collected from the real system. Writing the

$Q$

-function in a quadratic form, it is proved that the PIQL algorithm is equivalent to the Newton iteration method in the Banach space by using the Fréchet derivative. Then, the convergence of the PIQL algorithm can be guaranteed by Kantorovich’s theorem. For the realization of the PIQL algorithm, the off-policy learning scheme is proposed using real data rather than the system model. Finally, the efficiency of the developed data-based PIQL method is validated through simulation studies.

中文翻译：

策略迭代问-学习线性离散时间系统的基于数据的两人零和博弈

在本文中，考虑线性离散时间系统的基于数据的两人零和博弈问题。这个问题理论上依赖于离散时间博弈代数Riccati方程（DTGARE）的求解，而它需要完整的系统动力学。为了避免求解 DTGARE，

$Q$

- 引入功能和基于数据的策略迭代

$Q$

- 学习（PIQL）算法被开发来学习最优

$Q$

- 使用从真实系统收集的数据来运行。编写

$Q$

-函数的二次形式，证明PIQL算法等价于Banach空间中的牛顿迭代法，利用Fréchet导数。那么，PIQL 算法的收敛性可以由 Kantorovich 定理来保证。针对PIQL算法的实现，提出了使用真实数据而非系统模型的off-policy学习方案。最后，通过模拟研究验证了开发的基于数据的 PIQL 方法的效率。

更新日期：2020-02-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>