Safe Exploration for Optimizing Contextual Bandits,ACM Transactions on Information Systems

当前位置： X-MOL 学术 › ACM Trans. Inf. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Safe Exploration for Optimizing Contextual Bandits
ACM Transactions on Information Systems ( IF 5.4 ) Pub Date : 2020-05-04 , DOI: 10.1145/3385670
Rolf Jagerman ₁ , Ilya Markov ₁ , Maarten De Rijke ₁

Affiliation

Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, and so on. However, existing learning methods for contextual bandit problems have one of two drawbacks: They either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual bandit problems, Safe Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by using a baseline (or production) ranking system (i.e., policy), which does not harm the user experience and, thus, is safe to execute but has suboptimal performance and, thus, needs to be improved. Then SEA uses counterfactual learning to learn a new policy based on the behavior of the baseline policy. SEA also uses high-confidence off-policy evaluation to estimate the performance of the newly learned policy. Once the performance of the newly learned policy is at least as good as the performance of the baseline policy, SEA starts using the new policy to execute new actions, allowing it to actively explore favorable regions of the action space. This way, SEA never performs worse than the baseline policy and, thus, does not harm the user experience, while still exploring the action space and, thus, being able to find an optimal policy. Our experiments using text classification and document retrieval confirm the above by comparing SEA (and a boundless variant called BSEA) to online and offline learning methods for contextual bandit problems.

中文翻译：

优化上下文强盗的安全探索

上下文老虎机问题非常适合许多信息检索任务，例如学习排名、文本分类、推荐等。然而，针对上下文老虎机问题的现有学习方法有两个缺点之一：它们要么不探索所有可能的文档排名（即动作）的空间，因此可能会错过最佳排名，或者它们向用户提供次优排名因此，可能会损害用户体验。我们为上下文老虎机问题引入了一种新的学习方法，即安全探索算法（SEA），它克服了上述缺点。SEA 从使用基线（或生产）排名系统（即策略）开始，该系统不会损害用户体验，因此可以安全执行，但性能欠佳，因此需要改进。然后 SEA 使用反事实学习来学习基于基线策略行为的新策略。SEA 还使用高置信度的离策略评估来估计新学习策略的性能。一旦新学习到的策略的性能至少与基线策略的性能一样好，SEA 就开始使用新策略执行新动作，使其能够积极探索动作空间的有利区域。这样，SEA 的性能永远不会比基线策略差，因此不会损害用户体验，同时仍在探索动作空间，从而能够找到最佳策略。我们使用文本分类和文档检索的实验通过将 SEA（以及称为 BSEA 的无限变体）与针对上下文老虎机问题的在线和离线学习方法进行比较来证实上述情况。

更新日期：2020-05-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11