当前位置:
X-MOL 学术
›
arXiv.cs.LG
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies
arXiv - CS - Machine Learning Pub Date : 2020-11-29 , DOI: arxiv-2011.14359 Jinlin Lai, Lixin Zou, Jiaxing Song
arXiv - CS - Machine Learning Pub Date : 2020-11-29 , DOI: arxiv-2011.14359 Jinlin Lai, Lixin Zou, Jiaxing Song
Off-policy evaluation is a key component of reinforcement learning which
evaluates a target policy with offline data collected from behavior policies.
It is a crucial step towards safe reinforcement learning and has been used in
advertisement, recommender systems and many other applications. In these
applications, sometimes the offline data is collected from multiple behavior
policies. Previous works regard data from different behavior policies equally.
Nevertheless, some behavior policies are better at producing good estimators
while others are not. This paper starts with discussing how to correctly mix
estimators produced by different behavior policies. We propose three ways to
reduce the variance of the mixture estimator when all sub-estimators are
unbiased or asymptotically unbiased. Furthermore, experiments on simulated
recommender systems show that our methods are effective in reducing the
Mean-Square Error of estimation.
中文翻译:
具有多种行为策略的非政策评估的最佳混合权重
脱离策略评估是强化学习的关键组成部分,强化学习利用从行为策略收集的脱机数据评估目标策略。这是迈向安全强化学习的关键一步,已被用于广告,推荐系统和许多其他应用中。在这些应用程序中,有时脱机数据是从多个行为策略中收集的。以前的作品均等地考虑了来自不同行为策略的数据。但是,某些行为策略更擅长产生好的估计量,而其他行为策略则不然。本文从讨论如何正确混合由不同行为策略产生的估计量开始。当所有子估计都是无偏或渐近无偏时,我们提出了三种方法来减小混合估计的方差。此外,
更新日期:2020-12-01
中文翻译:
具有多种行为策略的非政策评估的最佳混合权重
脱离策略评估是强化学习的关键组成部分,强化学习利用从行为策略收集的脱机数据评估目标策略。这是迈向安全强化学习的关键一步,已被用于广告,推荐系统和许多其他应用中。在这些应用程序中,有时脱机数据是从多个行为策略中收集的。以前的作品均等地考虑了来自不同行为策略的数据。但是,某些行为策略更擅长产生好的估计量,而其他行为策略则不然。本文从讨论如何正确混合由不同行为策略产生的估计量开始。当所有子估计都是无偏或渐近无偏时,我们提出了三种方法来减小混合估计的方差。此外,