当前位置: X-MOL 学术Probab. Eng. Inf. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DUELING BANDIT PROBLEMS
Probability in the Engineering and Informational Sciences ( IF 0.7 ) Pub Date : 2020-11-20 , DOI: 10.1017/s0269964820000601
Erol Peköz 1 , Sheldon M. Ross 2 , Zhengyu Zhang 2
Affiliation  

There is a set of n bandits and at every stage, two of the bandits are chosen to play a game, with the result of a game being learned. In the “weak regret problem,” we suppose there is a “best” bandit that wins each game it plays with probability at least p > 1/2, with the value of p being unknown. The objective is to choose bandits to maximize the number of times that one of the competitors is the best bandit. In the “strong regret problem”, we suppose that bandit i has unknown value v i , i = 1, …, n, and that i beats j with probability v i /(v i + v j ). One version of strong regret is interested in maximizing the number of times that the contest is between the players with the two largest values. Another version supposes that at any stage, rather than choosing two arms to play a game, the decision maker can declare that a particular arm is the best, with the objective of maximizing the number of stages in which the arm with the largest value is declared to be the best. In the weak regret problem, we propose a policy and obtain an analytic bound on the expected number of stages over an infinite time frame that the best arm is not one of the competitors when this policy is employed. In the strong regret problem, we propose a Thompson sampling type algorithm and empirically compare its performance with others in the literature.

中文翻译:

决斗强盗问题

有一套n土匪,在每个阶段,选择两个土匪玩游戏,学习游戏的结果。在“弱遗憾问题”中,我们假设有一个“最佳”强盗至少以概率赢得每场比赛p> 1/2,值为p不为人知。目标是选择强盗,以最大化其中一位竞争者成为最佳强盗的次数。在“强遗憾问题”中,我们假设强盗一世具有未知价值v 一世 ,一世= 1, ...,n, 然后一世节拍j有概率v 一世 /(v 一世 +v j )。一个版本的强烈遗憾是对最大化具有两个最大值的玩家之间的比赛次数感兴趣。另一个版本假设在任何阶段,决策者可以宣布特定的手臂是最好的,而不是选择两个手臂来玩游戏,目标是最大化宣布具有最大值的手臂的阶段数成为最好的。在弱遗憾问题中,我们提出了一个策略,并获得了在无限时间框架内预期阶段数的分析界限,当采用该策略时,最佳手臂不是竞争对手之一。在强遗憾问题中,我们提出了一种 Thompson 采样类型算法,并通过经验将其性能与文献中的其他算法进行了比较。
更新日期:2020-11-20
down
wechat
bug