当前位置: X-MOL 学术IEEE Trans. Neural Netw. Learn. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Inverse Reinforcement Q-Learning Through Expert Imitation for Discrete-Time Systems
IEEE Transactions on Neural Networks and Learning Systems ( IF 10.4 ) Pub Date : 2021-09-14 , DOI: 10.1109/tnnls.2021.3106635
Wenqian Xue 1 , Bosen Lian 2 , Jialu Fan 1 , Patrik Kolaric 2 , Tianyou Chai 1 , Frank L. Lewis 2
Affiliation  

In inverse reinforcement learning (RL), there are two agents. An expert target agent has a performance cost function and exhibits control and state behaviors to a learner. The learner agent does not know the expert’s performance cost function but seeks to reconstruct it by observing the expert’s behaviors and tries to imitate these behaviors optimally by its own response. In this article, we formulate an imitation problem where the optimal performance intent of a discrete-time (DT) expert target agent is unknown to a DT Learner agent. Using only the observed expert’s behavior trajectory, the learner seeks to determine a cost function that yields the same optimal feedback gain as the expert’s, and thus, imitates the optimal response of the expert. We develop an inverse RL approach with a new scheme to solve the behavior imitation problem. The approach consists of a cost function update based on an extension of RL policy iteration and inverse optimal control, and a control policy update based on optimal control. Then, under this scheme, we develop an inverse reinforcement Q-learning algorithm, which is an extension of RL Q-learning. This algorithm does not require any knowledge of agent dynamics. Proofs of stability, convergence, and optimality are given. A key property about the nonunique solution is also shown. Finally, simulation experiments are presented to show the effectiveness of the new approach.

中文翻译:

通过离散时间系统专家模仿的逆强化 Q 学习

在逆向强化学习 (RL) 中,有两个智能体。专家目标代理具有性能成本函数,并向学习者展示控制和状态行为。Learner Agent 不知道专家的性能成本函数,而是试图通过观察专家的行为来重建它,并试图通过自己的反应来最佳地模仿这些行为。在本文中,我们制定了一个模仿问题,其中离散时间 (DT) 专家目标代理的最佳性能意图对于 DT 学习者代理是未知的。仅使用观察到的专家的行为轨迹,学习者寻求确定产生与专家相同的最佳反馈增益的成本函数,从而模仿专家的最佳响应。我们开发了一种具有新方案的逆 RL 方法来解决行为模仿问题。该方法包括基于 RL 策略迭代和逆向最优控制的扩展的成本函数更新,以及基于最优控制的控制策略更新。然后,在该方案下,我们开发了一种逆强化 Q-learning 算法,它是 RL Q-learning 的扩展。该算法不需要代理动力学的任何知识。给出了稳定性、收敛性和最优性的证明。还显示了关于非唯一解的一个关键属性。最后通过仿真实验验证了新方法的有效性。这是 RL Q-learning 的扩展。该算法不需要代理动力学的任何知识。给出了稳定性、收敛性和最优性的证明。还显示了关于非唯一解的一个关键属性。最后通过仿真实验验证了新方法的有效性。这是 RL Q-learning 的扩展。该算法不需要代理动力学的任何知识。给出了稳定性、收敛性和最优性的证明。还显示了关于非唯一解的一个关键属性。最后通过仿真实验验证了新方法的有效性。
更新日期:2021-09-14
down
wechat
bug