Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences,The International Journal of Robotics Research

当前位置： X-MOL 学术 › Int. J. Robot. Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences
The International Journal of Robotics Research ( IF 7.5 ) Pub Date : 2021-08-28 , DOI: 10.1177/02783649211041652
Erdem Bıyık ₁ , Dylan P. Losey ₂ , Malayandi Palan ₂ , Nicholas C. Landolfi ₂ , Gleb Shevchuk ₂ , Dorsa Sadigh _{1,

2}

Affiliation

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.

中文翻译：

从人类反馈的不同来源学习奖励函数：最佳整合演示和偏好

奖励函数是指定机器人目标的常用方法。由于设计奖励函数可能极具挑战性，因此更有前景的方法是直接从人类教师那里学习奖励函数。重要的是，来自人类教师的数据可以以多种形式被动或主动收集：被动数据源包括演示（例如，动觉指导），而偏好（例如，比较排名）是主动引出的。先前的研究已经独立地将奖励学习应用于这些不同的数据源。然而，存在多个领域，其中多个来源是互补和富有表现力的。受这个普遍问题的启发，我们提出了一个框架来整合多个信息源，这些信息源是从人类用户被动或主动收集的。特别是，我们提出了一种算法，该算法首先利用用户演示来初始化关于奖励函数的信念，然后通过偏好查询主动探查用户，将他们的真实奖励归零。该算法不仅使我们能够组合多个数据源，而且还通知机器人何时应该利用每种类型的信息。此外，我们的方法考虑了人类提供数据的能力：产生用户友好的偏好查询，这在理论上也是最佳的。我们对 Fetch 移动机械手的广泛模拟实验和用户研究证明了我们集成框架的优越性和可用性。该算法不仅使我们能够组合多个数据源，而且还通知机器人何时应该利用每种类型的信息。此外，我们的方法考虑了人类提供数据的能力：产生用户友好的偏好查询，这在理论上也是最佳的。我们对 Fetch 移动机械手的广泛模拟实验和用户研究证明了我们集成框架的优越性和可用性。该算法不仅使我们能够组合多个数据源，而且还通知机器人何时应该利用每种类型的信息。此外，我们的方法考虑了人类提供数据的能力：产生用户友好的偏好查询，这在理论上也是最佳的。我们对 Fetch 移动机械手的广泛模拟实验和用户研究证明了我们集成框架的优越性和可用性。

更新日期：2021-08-29

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文