Contextual Bandits for adapting to changing User preferences over time,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Contextual Bandits for adapting to changing User preferences over time
arXiv - CS - Machine Learning Pub Date : 2020-09-21 , DOI: arxiv-2009.10073
Dattaraj Rao

Contextual bandits provide an effective way to model the dynamic data problem in ML by leveraging online (incremental) learning to continuously adjust the predictions based on changing environment. We explore details on contextual bandits, an extension to the traditional reinforcement learning (RL) problem and build a novel algorithm to solve this problem using an array of action-based learners. We apply this approach to model an article recommendation system using an array of stochastic gradient descent (SGD) learners to make predictions on rewards based on actions taken. We then extend the approach to a publicly available MovieLens dataset and explore the findings. First, we make available a simplified simulated dataset showing varying user preferences over time and how this can be evaluated with static and dynamic learning algorithms. This dataset made available as part of this research is intentionally simulated with limited number of features and can be used to evaluate different problem-solving strategies. We will build a classifier using static dataset and evaluate its performance on this dataset. We show limitations of static learner due to fixed context at a point of time and how changing that context brings down the accuracy. Next we develop a novel algorithm for solving the contextual bandit problem. Similar to the linear bandits, this algorithm maps the reward as a function of context vector but uses an array of learners to capture variation between actions/arms. We develop a bandit algorithm using an array of stochastic gradient descent (SGD) learners, with separate learner per arm. Finally, we will apply this contextual bandit algorithm to predicting movie ratings over time by different users from the standard Movie Lens dataset and demonstrate the results.

中文翻译：

Contextual Bandits 用于适应随时间变化的用户偏好

上下文老虎机通过利用在线（增量）学习根据不断变化的环境不断调整预测，提供了一种对 ML 中的动态数据问题进行建模的有效方法。我们探索了上下文强盗的细节，这是对传统强化学习 (RL) 问题的扩展，并构建了一种新的算法来使用一系列基于动作的学习器来解决这个问题。我们将这种方法应用于使用一系列随机梯度下降 (SGD) 学习器对文章推荐系统进行建模，以根据采取的行动对奖励进行预测。然后，我们将该方法扩展到公开可用的 MovieLens 数据集并探索结果。首先，我们提供了一个简化的模拟数据集，显示随时间变化的用户偏好以及如何使用静态和动态学习算法对其进行评估。作为本研究的一部分提供的这个数据集是有意用有限数量的特征模拟的，可用于评估不同的问题解决策略。我们将使用静态数据集构建分类器并评估其在该数据集上的性能。我们展示了由于某个时间点的固定上下文以及更改该上下文如何降低准确性而导致的静态学习器的局限性。接下来，我们开发了一种新的算法来解决上下文老虎机问题。与线性老虎机类似，该算法将奖励映射为上下文向量的函数，但使用一组学习器来捕获动作/手臂之间的变化。我们使用一组随机梯度下降 (SGD) 学习器开发了一种强盗算法，每个臂都有单独的学习器。最后，

更新日期：2020-09-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文