Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies
arXiv - CS - Machine Learning Pub Date : 2020-11-29 , DOI: arxiv-2011.14359
Jinlin Lai, Lixin Zou, Jiaxing Song

Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies. It is a crucial step towards safe reinforcement learning and has been used in advertisement, recommender systems and many other applications. In these applications, sometimes the offline data is collected from multiple behavior policies. Previous works regard data from different behavior policies equally. Nevertheless, some behavior policies are better at producing good estimators while others are not. This paper starts with discussing how to correctly mix estimators produced by different behavior policies. We propose three ways to reduce the variance of the mixture estimator when all sub-estimators are unbiased or asymptotically unbiased. Furthermore, experiments on simulated recommender systems show that our methods are effective in reducing the Mean-Square Error of estimation.

中文翻译：

具有多种行为策略的非政策评估的最佳混合权重

脱离策略评估是强化学习的关键组成部分，强化学习利用从行为策略收集的脱机数据评估目标策略。这是迈向安全强化学习的关键一步，已被用于广告，推荐系统和许多其他应用中。在这些应用程序中，有时脱机数据是从多个行为策略中收集的。以前的作品均等地考虑了来自不同行为策略的数据。但是，某些行为策略更擅长产生好的估计量，而其他行为策略则不然。本文从讨论如何正确混合由不同行为策略产生的估计量开始。当所有子估计都是无偏或渐近无偏时，我们提出了三种方法来减小混合估计的方差。此外，

更新日期：2020-12-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文