Preference-based Learning of Reward Function Features,arXiv - CS - Robotics

当前位置： X-MOL 学术 › arXiv.cs.RO › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Preference-based Learning of Reward Function Features
arXiv - CS - Robotics Pub Date : 2021-03-03 , DOI: arxiv-2103.02727
Sydney M. Katz, Amir Maleki, Erdem Bıyık, Mykel J. Kochenderfer

Preference-based learning of reward functions, where the reward function is learned using comparison data, has been well studied for complex robotic tasks such as autonomous driving. Existing algorithms have focused on learning reward functions that are linear in a set of trajectory features. The features are typically hand-coded, and preference-based learning is used to determine a particular user's relative weighting for each feature. Designing a representative set of features to encode reward is challenging and can result in inaccurate models that fail to model the users' preferences or perform the task properly. In this paper, we present a method to learn both the relative weighting among features as well as additional features that help encode a user's reward function. The additional features are modeled as a neural network that is trained on the data from pairwise comparison queries. We apply our methods to a driving scenario used in previous work and compare the predictive power of our method to that of only hand-coded features. We perform additional analysis to interpret the learned features and examine the optimal trajectories. Our results show that adding an additional learned feature to the reward model enhances both its predictive power and expressiveness, producing unique results for each user.

中文翻译：

奖励功能特征的基于偏好的学习

奖励功能的基于偏好的学习（其中使用比较数据来学习奖励功能）已经针对复杂的机器人任务（例如自动驾驶）进行了深入研究。现有算法集中于学习在一组轨迹特征中是线性的奖励函数。特征通常是手工编码的，并且基于首选项的学习用于确定特定用户对每个特征的相对权重。设计一组具有代表性的功能来对奖励进行编码非常具有挑战性，并且可能导致模型不正确，从而无法对用户的偏好进行建模或无法正确执行任务。在本文中，我们提出了一种方法来学习功能之间的相对权重以及有助于编码用户奖励功能的其他功能。附加功能被建模为一个神经网络，该神经网络对来自成对比较查询的数据进行训练。我们将我们的方法应用于以前的工作中的驾驶场景，并将我们的方法的预测能力与仅手工编码的功能进行比较。我们执行附加分析以解释学习到的特征并检查最佳轨迹。我们的结果表明，将额外的学习功能添加到奖励模型中可以增强其预测能力和表达能力，从而为每个用户产生独特的结果。

更新日期：2021-03-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文