当前位置: X-MOL 学术Int. J. Robot. Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dynamic regret convergence analysis and an adaptive regularization algorithm for on-policy robot imitation learning
The International Journal of Robotics Research ( IF 7.5 ) Pub Date : 2021-01-24 , DOI: 10.1177/0278364920985879
Jonathan N. Lee 1 , Michael Laskey 2 , Ajay Kumar Tanwani 1 , Anil Aswani 3 , Ken Goldberg 1
Affiliation  

On-policy imitation learning algorithms such as DAgger evolve a robot control policy by executing it, measuring performance (loss), obtaining corrective feedback from a supervisor, and generating the next policy. As the loss between iterations can vary unpredictably, a fundamental question is under what conditions this process will eventually achieve a converged policy. If one assumes the underlying trajectory distribution is static (stationary), it is possible to prove convergence for DAgger. However, in more realistic models for robotics, the underlying trajectory distribution is dynamic because it is a function of the policy. Recent results show it is possible to prove convergence of DAgger when a regularity condition on the rate of change of the trajectory distributions is satisfied. In this article, we reframe this result using dynamic regret theory from the field of online optimization and show that dynamic regret can be applied to any on-policy algorithm to analyze its convergence and optimality. These results inspire a new algorithm, Adaptive On-Policy Regularization (Aor), that ensures the conditions for convergence. We present simulation results with cart–pole balancing and locomotion benchmarks that suggest Aor can significantly decrease dynamic regret and chattering as the robot learns. To the best of the authors’ knowledge, this is the first application of dynamic regret theory to imitation learning.



中文翻译:

基于策略的机器人模仿学习的动态后悔收敛分析和自适应正则化算法

策略上的模仿学习算法(例如DAgger)通过执行机器人控制策略,测量性能(损失),从主管获得纠正性反馈并生成下一个策略来发展机器人控制策略。由于迭代之间的损失可能会发生不可预测的变化,因此一个基本问题是,此过程最终将在何种条件下实现融合策略。如果假设基本轨迹分布是静态的(平稳的),则有可能证明DAgger的收敛性。但是,在更现实的机器人模型中,基本轨迹分布是动态的,因为它是策略的函数。最近的结果表明,当满足轨迹分布变化率的规则性条件时,有可能证明DAgger的收敛性。在这篇文章中,我们使用在线优化领域的动态后悔理论对结果进行了重新构架,并表明动态后悔可以应用于任何基于策略的算法来分析其收敛性和最优性。这些结果激发了一种新算法“自适应策略上正则化”(Aor),以确保收敛的条件。我们提供了具有手推车杆平衡和运动基准的仿真结果,表明Aor可以随着机器人的学习大大减少动态后悔和颤抖。据作者所知,这是动态后悔理论在模仿学习中的首次应用。

更新日期:2021-01-25
down
wechat
bug