Query-Policy Misalignment in Preference-Based Reinforcement Learning,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Query-Policy Misalignment in Preference-Based Reinforcement Learning
arXiv - CS - Machine Learning Pub Date : 2023-05-27 , DOI: arxiv-2305.17400
Xiao Hu, Jianxiong Li, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang

Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.

中文翻译：

基于偏好的强化学习中的查询策略错位

基于偏好的强化学习 (PbRL) 提供了一种使 RL 代理的行为与人类期望的结果保持一致的自然方式，但通常会受到代价高昂的人类反馈的限制。为了提高反馈效率，现有的大多数 PbRL 方法都侧重于选择查询以最大限度地提高奖励模型的整体质量，但与直觉相反，我们发现这不一定会带来性能提升。为了揭开这个谜团，我们在现有 PbRL 研究的查询选择方案中发现了一个长期被忽视的问题：查询策略错位。我们表明，为提高奖励模型的整体质量而选择的看似信息丰富的查询实际上可能与 RL 代理的兴趣不一致，因此对政策学习几乎没有帮助，最终导致反馈效率低下。我们表明，这个问题可以通过近策略查询和专门设计的混合体验重播来有效解决，它们共同强制执行双向查询策略对齐。简单而优雅，我们的方法只需更改几行代码即可轻松融入现有方法。我们在综合实验中展示了我们的方法在人工反馈和 RL 样本效率方面取得了实质性进展，证明了解决 PbRL 任务中查询策略失调的重要性。

更新日期：2023-05-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>