Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning
arXiv - CS - Machine Learning Pub Date : 2020-11-29 , DOI: arxiv-2011.14266
Hongseok Namkoong, Samuel Daulton, Eytan Bakshy

Thompson sampling (TS) has emerged as a robust technique for contextual bandit problems. However, TS requires posterior inference and optimization for action generation, prohibiting its use in many internet applications where latency and ease of deployment are of concern. We propose a novel imitation-learning-based algorithm that distills a TS policy into an explicit policy representation by performing posterior inference and optimization offline. The explicit policy representation enables fast online decision-making and easy deployment in mobile and server-based environments. Our algorithm iteratively performs offline batch updates to the TS policy and learns a new imitation policy. Since we update the TS policy with observations collected under the imitation policy, our algorithm emulates an off-policy version of TS. Our imitation algorithm guarantees Bayes regret comparable to TS, up to the sum of single-step imitation errors. We show these imitation errors can be made arbitrarily small when unlabeled contexts are cheaply available, which is the case for most large-scale internet applications. Empirically, we show that our imitation policy achieves comparable regret to TS, while reducing decision-time latency by over an order of magnitude. Our algorithm is deployed in video upload systems at Facebook and Instagram and is handling millions of uploads each day.

中文翻译：

蒸馏汤普森采样：通过模仿学习进行实用而有效的汤普森采样

汤普森抽样（TS）已经成为解决背景匪徒问题的可靠技术。但是，TS需要后推和优化以生成动作，从而禁止将其用于关注延迟和易于部署的许多Internet应用程序中。我们提出了一种新颖的基于模仿学习的算法，该算法通过执行后验推理和离线优化将TS策略提炼为明确的策略表示形式。明确的策略表示支持快速的在线决策制定，并易于在基于移动和服务器的环境中进行部署。我们的算法会迭代地对TS策略执行离线批量更新，并学习新的模仿策略。由于我们使用模仿策略下收集的观察结果更新了TS策略，因此我们的算法将模拟TS的非策略版本。我们的模仿算法可确保贝叶斯后悔可与TS相提并论，最高可达单步模仿错误之和。我们显示出，当无标签的上下文很便宜时，这些模仿错误可以任意减小，大多数大型Internet应用就是这种情况。从经验上看，我们证明了我们的模仿策略可达到与TS相当的遗憾，同时将决策时间延迟减少了一个数量级。我们的算法已部署在Facebook和Instagram的视频上传系统中，每天处理数百万次上传。我们表明，我们的模仿策略可达到与TS相当的遗憾，同时将决策时间延迟减少了一个数量级。我们的算法已部署在Facebook和Instagram的视频上传系统中，每天处理数百万次上传。我们表明，我们的模仿策略可达到与TS相当的遗憾，同时将决策时间延迟减少了一个数量级。我们的算法已部署在Facebook和Instagram的视频上传系统中，每天处理数百万次上传。

更新日期：2020-12-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>