当前位置: X-MOL 学术IEEE Trans. Neural Netw. Learn. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.
IEEE Transactions on Neural Networks and Learning Systems ( IF 10.2 ) Pub Date : 2021-05-03 , DOI: 10.1109/tnnls.2020.2995920
Shangdong Yang , Yang Gao

This brief studies a variation of the stochastic multiarmed bandit (MAB) problems, where the agent knows the a priori knowledge named the near-optimal mean reward (NoMR). In common MAB problems, an agent tries to find the optimal arm without knowing the optimal mean reward. However, in more practical applications, the agent can usually get an estimation of the optimal mean reward defined as NoMR. For instance, in an online Web advertising system based on MAB methods, a user's near-optimal average click rate (NoMR) can be roughly estimated from his/her demographic characteristics. As a result, application of the NoMR is efficient at improving the algorithm's performance. First, we formalize the stochastic MAB problem by knowing the NoMR that is in between the suboptimal mean reward and the optimal mean reward. Second, we use the cumulative regret as the performance metric for our problem, and we get that this problem's lower bound of the cumulative regret is Ω(1/∆) , where ∆ is the difference between the suboptimal mean reward and the optimal mean reward. Compared with the conventional MAB problem with the increasing logarithmic lower bound of the regret, our regret lower bound is uniform with the learning step. Third, a novel algorithm, NoMR-BANDIT, is set forth to solve this problem. In NoMR-BANDIT, the NoMR is used to design an efficient exploration strategy. In addition, we analyzed the regret's upper bound in NoMR-BANDIT and concluded that it also has a uniform upper bound of O(1/∆) , which is in the same order as the lower bound. Consequently, NoMR-BANDIT is an optimal algorithm of this problem. To enhance our method's generalization, CASCADE-BANDIT based on NoMR-BANDIT is proposed to solve the problem, where NoMR is less than the suboptimal mean reward. CASCADE-BANDIT has an upper bound of O(∆logn) , where n represents the learning step, and the order of O(∆logn) is the same with that of the conventional MAB methods. Finally, extensive experimental results demonstrated that the established NoMR-BANDIT is more efficient than the compared bandit solutions. After sufficient iterations, NOMR-BANDIT saved 10%-80% more cumulative regret than the state of the art.

中文翻译:

知道近最优平均奖励的随机强盗的最优算法。

本简报研究了随机多臂老虎机 (MAB) 问题的变体,其中代理知道称为近最优平均奖励 (NoMR) 的先验知识。在常见的 MAB 问题中,代理试图在不知道最佳平均奖励的情况下找到最佳臂。然而,在更实际的应用中,代理通常可以获得定义为 NoMR 的最佳平均奖励的估计。例如,在基于 MAB 方法的在线 Web 广告系统中,用户的接近最优的平均点击率 (NoMR) 可以从他/她的人口统计特征中粗略估计。因此,NoMR 的应用在提高算法性能方面是有效的。首先,我们通过知道介于次优平均奖励和最优平均奖励之间的 NoMR 来形式化随机 MAB 问题。第二,我们使用累积后悔作为我们问题的性能指标,我们得到这个问题的累积后悔下界是 Ω(1/∆) ,其中 ∆ 是次优平均奖励和最优平均奖励之间的差值。与遗憾的对数下界增加的传统 MAB 问题相比,我们的遗憾下界与学习步骤是一致的。第三,提出了一种新颖的算法 NoMR-BANDIT 来解决这个问题。在 NoMR-BANDIT 中,NoMR 用于设计有效的探索策略。此外,我们分析了NoMR-BANDIT中re悔的上界,得出结论,它也有一个统一的上界O(1/Δ),与下界的顺序相同。因此,NoMR-BANDIT 是该问题的最佳算法。为了增强我们方法的泛化能力,提出了基于 NoMR-BANDIT 的 CASCADE-BANDIT 来解决 NoMR 小于次优平均奖励的问题。CASCADE-BANDIT 的上界为 O(∆logn) ,其中 n 代表学习步骤,O(∆logn) 的阶数与传统 MAB 方法相同。最后,广泛的实验结果表明,建立的 NoMR-BANDIT 比比较的老虎机解决方案更有效。经过足够的迭代,NOMR-BANDIT 比最先进的技术节省了 10%-80% 的累积遗憾。广泛的实验结果表明,建立的 NoMR-BANDIT 比比较的强盗解决方案更有效。经过足够的迭代,NOMR-BANDIT 比最先进的技术节省了 10%-80% 的累积遗憾。广泛的实验结果表明,建立的 NoMR-BANDIT 比比较的老虎机解决方案更有效。经过足够的迭代,NOMR-BANDIT 比最先进的技术节省了 10%-80% 的累积遗憾。
更新日期:2020-06-01
down
wechat
bug