当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generative Adversarial Reward Learning for Generalized Behavior Tendency Inference
arXiv - CS - Information Retrieval Pub Date : 2021-05-03 , DOI: arxiv-2105.00822
Xiaocong Chen, Lina Yao, Xianzhi Wang, Aixin Sun, Wenjie Zhang, Quan Z. Sheng

Recent advances in reinforcement learning have inspired increasing interest in learning user modeling adaptively through dynamic interactions, e.g., in reinforcement learning based recommender systems. Reward function is crucial for most of reinforcement learning applications as it can provide the guideline about the optimization. However, current reinforcement-learning-based methods rely on manually-defined reward functions, which cannot adapt to dynamic and noisy environments. Besides, they generally use task-specific reward functions that sacrifice generalization ability. We propose a generative inverse reinforcement learning for user behavioral preference modelling, to address the above issues. Instead of using predefined reward functions, our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN. Our model provides a general way of characterizing and explaining underlying behavioral tendencies, and our experiments show our method outperforms state-of-the-art methods in a variety of scenarios, namely traffic signal control, online recommender systems, and scanpath prediction.

中文翻译:

广义行为倾向推理的生成对抗性奖励学习

强化学习的最新进展激发了人们对通过动态交互(例如,基于强化学习的推荐系统)自适应地学习用户建模的兴趣。奖励功能对于大多数强化学习应用程序至关重要,因为它可以提供有关优化的指南。但是,当前基于强化学习的方法依赖于手动定义的奖励功能,该功能无法适应动态和嘈杂的环境。此外,他们通常使用特定于任务的奖励功能,这会牺牲泛化能力。我们针对用户行为偏好建模提出了一种生成逆强化学习方法,以解决上述问题。无需使用预定义的奖励功能,我们的模型可以自动从用户的 的行动基于歧视性行为者批评网络和Wasserstein GAN。我们的模型提供了表征和解释潜在行为趋势的通用方法,并且我们的实验表明,在各种情况下,我们的方法要优于最新方法,这些情况包括交通信号控制,在线推荐系统和扫描路径预测。
更新日期:2021-05-04
down
wechat
bug