当前位置: X-MOL 学术Math. Meth. Oper. Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An asymptotically optimal strategy for constrained multi-armed bandit problems
Mathematical Methods of Operations Research ( IF 1.2 ) Pub Date : 2020-01-02 , DOI: 10.1007/s00186-019-00697-3
Hyeong Soo Chang

This note considers the model of “constrained multi-armed bandit” (CMAB) that generalizes that of the classical stochastic MAB by adding a feasibility constraint for each action. The feasibility is in fact another (conflicting) objective that should be kept in order for a playing-strategy to achieve the optimality of the main objective. While the stochastic MAB model is a special case of the Markov decision process (MDP) model, the CMAB model is a special case of the constrained MDP model. For the asymptotic optimality measured by the probability of choosing an optimal feasible arm over infinite horizon, we show that the optimality is achievable by a simple strategy extended from the \(\epsilon _t\)-greedy strategy used for unconstrained MAB problems. We provide a finite-time lower bound on the probability of correct selection of an optimal near-feasible arm that holds for all time steps. Under some conditions, the bound approaches one as time t goes to infinity. A particular example sequence of \(\{\epsilon _t\}\) having the asymptotic convergence rate in the order of \((1-\frac{1}{t})^4\) that holds from a sufficiently large t is also discussed.

中文翻译:

约束多臂匪问题的一种渐近最优策略

本说明考虑了“约束多武装匪徒”(CMAB)模型,该模型通过为每个操作添加可行性约束来概括经典随机MAB的模型。实际上,可行性是另一个(冲突的)目标,应该保留该目标,以使游戏策略实现主要目标的最优性。随机MAB模型是Markov决策过程(MDP)模型的特例,而CMAB模型是约束MDP模型的特例。对于通过在无限地平线上选择最佳可行臂的概率来衡量的渐近最优性,我们表明,最优性是可以通过从\(\ epsilon _t \)扩展的简单策略来实现的-贪心策略用于不受约束的MAB问题。我们提供了正确选择所有时间步长的最佳近可行臂的概率的有限时间下限。在某些情况下,随着时间t趋于无穷大,边界接近1 。具有(\(1- \ frac {1} {t})^ 4 \)的渐近收敛速度的\(\ {\ epsilon _t \} \)的特定示例序列,从足够大的t开始成立也进行了讨论。
更新日期:2020-01-02
down
wechat
bug