当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Partially observable environment estimation with uplift inference for reinforcement learning based recommendation
Machine Learning ( IF 4.3 ) Pub Date : 2021-04-14 , DOI: 10.1007/s10994-021-05969-w
Wenjie Shang , Qingyang Li , Zhiwei Qin , Yang Yu , Yiping Meng , Jieping Ye

Reinforcement learning (RL) aims at searching the best policy model for decision making, and has been shown powerful for sequential recommendations. The training of the policy by RL, however, is placed in an environment. In many real-world applications, the policy training in the real environment can cause an unbearable cost due to the exploration. Environment estimation from the past data is thus an appealing way to release the power of RL in these applications. The estimation of the environment is, basically, to extract the causal effect model from the data. However, real-world applications are often too complex to offer fully observable environment information. Therefore, quite possibly there are unobserved variables lying behind the data, which can obstruct an effective estimation of the environment. In this paper, by treating the hidden variables as a hidden policy, we propose a partially-observed multi-agent environment estimation (POMEE) approach to learn the partially-observed environment. To make a better extraction of the causal relationship between actions and rewards, we design a deep uplift inference network (DUIN) model to learn the causal effects of different actions. By implementing the environment model in the DUIN structure, we propose a POMEE with uplift inference (POMEE-UI) approach to generate a partially-observed environment with a causal reward mechanism. We analyze the effect of our method in both artificial and real-world environments. We first use an artificial recommender environment, abstracted from a real-world application, to verify the effectiveness of POMEE-UI. We then test POMEE-UI in the real application of Didi Chuxing. Experiment results show that POMEE-UI can effectively estimate the hidden variables, leading to a more reliable virtual environment. The online A/B testing results show that POMEE can derive a well-performing recommender policy in the real-world application.



中文翻译:

基于隆起推断的部分可观察环境估计,用于基于增强学习的推荐

强化学习(RL)旨在搜索决策的最佳策略模型,并已显示出可用于顺序建议的强大功能。但是,RL对策略的培训是放在一个环境中的。在许多实际应用中,由于探索,在实际环境中进行策略培训可能会导致难以承受的成本。因此,根据过去的数据进行环境评估是在这些应用中释放RL功能的一种有吸引力的方法。基本上,环境评估是从数据中提取因果关系模型。但是,现实世界中的应用程序通常过于复杂,无法提供可完全观察到的环境信息。因此,很可能在数据后面有一些未观察到的变量,这些变量可能会妨碍对环境的有效估计。在本文中,部分观察的多主体环境估计(POMEE)方法来学习部分观察的环境。为了更好地提取行为与报酬之间的因果关系,我们设计了一个深层提升推理网络(DUIN)模型,以了解不同行为之间的因果关系。通过在DUIN结构中实现环境模型,我们提出了具有上推推论POMEE(POMEE-UI)方法来生成具有因果奖励机制的部分观察环境。我们分析了我们的方法在人工和现实环境中的效果。我们首先使用从真实应用程序中抽象出来的人工推荐环境,以验证POMEE-UI的有效性。然后我们在滴滴出行的实际应用中测试POMEE-UI。实验结果表明,POMEE-UI可以有效地估计隐藏变量,从而提供更可靠的虚拟环境。在线A / B测试结果表明,POMEE可以在实际应用程序中得出性能良好的推荐程序策略。

更新日期:2021-04-15
down
wechat
bug