当前位置: X-MOL 学术IEEE Trans. Cybern. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Prioritizing Useful Experience Replay for Heuristic Dynamic Programming-Based Learning Systems
IEEE Transactions on Cybernetics ( IF 9.4 ) Pub Date : 7-19-2019 , DOI: 10.1109/tcyb.2018.2853582
Zhen Ni , Naresh Malla , Xiangnan Zhong

The adaptive dynamic programming controller usually needs a long training period because the data usage efficiency is relatively low by discarding the samples once used. Prioritized experience replay (ER) promotes important experiences and is more efficient in learning the control process. This paper proposes integrating an efficient learning capability of prioritized ER design into heuristic dynamic programming (HDP). First, a one time-step backward state-action pair is used to design the ER tuple and, thus, avoids the model network. Second, a systematic approach is proposed to integrate the ER into both critic and action networks of HDP controller design. The proposed approach is tested for two case studies: a cart-pole balancing task and a triple-link pendulum balancing task. For fair comparison, we set the same initial weight parameters and initial starting states for both traditional HDP and the proposed approach under the same simulation environment. The proposed approach improves the required average number of trials to succeed by 60.56% for cart-pole, and 56.89% for triple-link balancing tasks, in comparison with the traditional HDP approach. Also, we have added results of ER-based HDP for comparison. Moreover, theoretical convergence analysis is presented to guarantee the stability of the proposed control design.

中文翻译:


优先考虑基于启发式动态规划的学习系统的有用经验回放



自适应动态规划控制器通常需要较长的训练周期,因为一旦使用过的样本就被丢弃,数据使用效率相对较低。优先经验重放 (ER) 可提升重要经验,并且更有效地学习控制过程。本文提出将优先 ER 设计的高效学习能力集成到启发式动态规划(HDP)中。首先,使用向后一步的状态-动作对来设计 ER 元组,从而避免模型网络。其次,提出了一种系统方法,将 ER 集成到 HDP 控制器设计的批评网络和行动网络中。所提出的方法针对两个案例研究进行了测试:车杆平衡任务和三连杆摆平衡任务。为了公平比较,我们在相同的仿真环境下为传统 HDP 和所提出的方法设置相同的初始权重参数和初始起始状态。与传统的 HDP 方法相比,所提出的方法将车杆成功所需的平均试验次数提高了 60.56%,将三连杆平衡任务成功所需的平均试验次数提高了 56.89%。此外,我们还添加了基于 ER 的 HDP 的结果进行比较。此外,还提出了理论收敛分析,以保证所提出的控制设计的稳定性。
更新日期:2024-08-22
down
wechat
bug