GuideBoot: Guided Bootstrap for Deep Contextual Bandits,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

GuideBoot: Guided Bootstrap for Deep Contextual Bandits
arXiv - CS - Information Retrieval Pub Date : 2021-07-18 , DOI: arxiv-2107.08383
Feiyang Pan, Haoming Li, Xiang Ao, Wei Wang, Yanrong Kang, Ao Tan, Qing He

The exploration/exploitation (E&E) dilemma lies at the core of interactive systems such as online advertising, for which contextual bandit algorithms have been proposed. Bayesian approaches provide guided exploration with principled uncertainty estimation, but the applicability is often limited due to over-simplified assumptions. Non-Bayesian bootstrap methods, on the other hand, can apply to complex problems by using deep reward models, but lacks clear guidance to the exploration behavior. It still remains largely unsolved to develop a practical method for complex deep contextual bandits. In this paper, we introduce Guided Bootstrap (GuideBoot for short), combining the best of both worlds. GuideBoot provides explicit guidance to the exploration behavior by training multiple models over both real samples and noisy samples with fake labels, where the noise is added according to the predictive uncertainty. The proposed method is efficient as it can make decisions on-the-fly by utilizing only one randomly chosen model, but is also effective as we show that it can be viewed as a non-Bayesian approximation of Thompson sampling. Moreover, we extend it to an online version that can learn solely from streaming data, which is favored in real applications. Extensive experiments on both synthetic task and large-scale advertising environments show that GuideBoot achieves significant improvements against previous state-of-the-art methods.

中文翻译：

GuideBoot：深度上下文强盗的引导引导程序

探索/利用 (E&E) 困境是在线广告等交互系统的核心，为此提出了上下文强盗算法。贝叶斯方法通过原则性的不确定性估计提供指导性探索，但由于过度简化的假设，其适用性通常受到限制。另一方面，非贝叶斯 bootstrap 方法可以通过使用深度奖励模型应用于复杂问题，但缺乏对探索行为的明确指导。为复杂的深度上下文强盗开发一种实用的方法在很大程度上仍然没有解决。在本文中，我们介绍了 Guided Bootstrap（简称 GuideBoot），结合了两者的优点。GuideBoot 通过在真实样本和带有假标签的嘈杂样本上训练多个模型，为探索行为提供明确的指导，其中根据预测不确定性添加噪声。所提出的方法是有效的，因为它可以通过仅使用一个随机选择的模型即时做出决策，但也很有效，因为我们表明它可以被视为汤普森采样的非贝叶斯近似。此外，我们将其扩展为可以仅从流数据中学习的在线版本，这在实际应用中受到青睐。在合成任务和大规模广告环境上的大量实验表明，GuideBoot 相对于以前的最先进方法取得了显着的改进。但也很有效，因为我们表明它可以被视为汤普森采样的非贝叶斯近似。此外，我们将其扩展为可以仅从流数据中学习的在线版本，这在实际应用中受到青睐。在合成任务和大规模广告环境上的大量实验表明，GuideBoot 相对于以前的最先进方法取得了显着的改进。但也很有效，因为我们表明它可以被视为汤普森采样的非贝叶斯近似。此外，我们将其扩展为可以仅从流数据中学习的在线版本，这在实际应用中受到青睐。在合成任务和大规模广告环境上的大量实验表明，GuideBoot 相对于以前的最先进方法取得了显着的改进。

更新日期：2021-07-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>