Reward Learning From Very Few Demonstrations,IEEE Transactions on Robotics

当前位置： X-MOL 学术 › IEEE Trans. Robot. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Reward Learning From Very Few Demonstrations
IEEE Transactions on Robotics ( IF 9.4 ) Pub Date : 2020-12-07 , DOI: 10.1109/tro.2020.3038698
Cem Eteke , Dogancan Kebude , Bars Akgun

This article introduces a novel skill learning framework that learns rewards from very few demonstrations and uses them in policy search (PS) to improve the skill. The demonstrations are used to learn a parameterized policy to execute the skill and a goal model, as a hidden Markov model (HMM), to monitor executions. The rewards are learned from the HMM structure and its monitoring capability. The HMM is converted to a finite-horizon Markov reward process (MRP). A Monte Carlo approach is used to calculate its values. Then, the HMM and the values are merged into a partially observable MRP to obtain execution returns to be used with PS for improving the policy. In addition to reward learning, a black box PS method with an adaptive exploration strategy is adopted. The resulting framework is evaluated with five PS approaches and two skills in simulation. The results show that the learned dense rewards lead to better performance compared to sparse monitoring signals, and using an adaptive exploration lead to faster convergence with higher success rates and lower variance. The efficacy of the framework is validated in a real-robot settings by improving three skills to complete success from complete failure using learned rewards where sparse rewards failed completely.

中文翻译：

从很少的演示中奖励学习

本文介绍了一种新颖的技能学习框架，该框架从很少的演示中学习奖励，并在策略搜索 (PS) 中使用它们来提高技能。演示用于学习参数化策略来执行技能和目标模型，作为隐藏的马尔可夫模型 (HMM)，以监控执行。奖励是从 HMM 结构及其监控能力中学习的。HMM 被转换为有限范围马尔可夫奖励过程 (MRP)。Monte Carlo 方法用于计算其值。然后，将 HMM 和这些值合并到一个部分可观察的 MRP 中，以获得与 PS 一起使用以改进策略的执行回报。除了奖励学习之外，还采用了具有自适应探索策略的黑盒 PS 方法。由此产生的框架通过五种 PS 方法和两种模拟技能进行评估。结果表明，与稀疏监测信号相比，学习到的密集奖励导致更好的性能，并且使用自适应探索导致更快的收敛，更高的成功率和更低的方差。该框架的功效在真实机器人设置中得到验证，通过使用学习奖励（其中稀疏奖励完全失败）提高三项技能以从完全失败中获得成功。

更新日期：2020-12-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11