Off-policy integral reinforcement learning algorithm in dealing with nonzero sum game for nonlinear distributed parameter systems,Transactions of the Institute of Measurement and Control

当前位置： X-MOL 学术 › Trans. Inst. Meas. Control › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Off-policy integral reinforcement learning algorithm in dealing with nonzero sum game for nonlinear distributed parameter systems
Transactions of the Institute of Measurement and Control ( IF 1.7 ) Pub Date : 2020-07-06 , DOI: 10.1177/0142331220932634
He Ren ₁ , Jing Dai ₂ , Huaguang Zhang ₁ , Kun Zhang ₁

Affiliation

Benefitting from the technology of integral reinforcement learning, the nonzero sum (NZS) game for distributed parameter systems is effectively solved in this paper when the information of system dynamics are unavailable. The Karhunen-Loève decomposition (KLD) is employed to convert the partial differential equation (PDE) systems into high-order ordinary differential equation (ODE) systems. Moreover, the off-policy IRL technology is introduced to design the optimal strategies for the NZS game. To confirm that the presented algorithm will converge to the optimal value functions, the traditional adaptive dynamic programming (ADP) method is first discussed. Then, the equivalence between the traditional ADP method and the presented off-policy method is proved. For implementing the presented off-policy IRL method, actor and critic neural networks are utilized to approach the value functions and control strategies in the iteration process, individually. Finally, a numerical simulation is shown to illustrate the effectiveness of the proposal off-policy algorithm.

中文翻译：

非线性分布参数系统非零和博弈的off-policy积分强化学习算法

受益于积分强化学习技术，本文有效地解决了分布式参数系统的非零和（NZS）博弈在系统动力学信息不可用时的问题。Karhunen-Loève 分解 (KLD) 用于将偏微分方程 (PDE) 系统转换为高阶常微分方程 (ODE) 系统。此外，还引入了off-policy IRL技术来设计NZS博弈的最优策略。为了确认所提出的算法将收敛到最优值函数，首先讨论传统的自适应动态规划 (ADP) 方法。然后，证明了传统ADP方法与所提出的off-policy方法之间的等价性。为了实现所提出的离策略 IRL 方法，演员和评论家神经网络被用来分别处理迭代过程中的价值函数和控制策略。最后，通过数值模拟来说明建议off-policy算法的有效性。

更新日期：2020-07-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11