当前位置: X-MOL 学术Stat › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
To update or not to update? Delayed nonparametric bandits with randomized allocation
Stat ( IF 0.7 ) Pub Date : 2021-02-15 , DOI: 10.1002/sta4.366
Sakshi Arya 1 , Yuhong Yang 1
Affiliation  

Delayed rewards problem in contextual bandits has been of interest in various practical settings. We study randomized allocation strategies and provide an understanding on how the exploration–exploitation trade‐off is affected by delays in observing the rewards. In randomized strategies, the extent of exploration–exploitation is controlled by a user‐determined exploration probability sequence. In the presence of delayed rewards, one may choose between using the original exploration sequence that updates at every time point or updates the sequence only when a new reward is observed, leading to two competing strategies. In this work, we show that although both strategies may lead to strong consistency in allocation, the property holds for a wider scope of situations for the latter. However, for finite‐sample performance, we illustrate that both strategies have their own advantages and disadvantages, depending on the severity of the delay and underlying reward‐generating mechanisms.

中文翻译:

要更新还是不更新?随机分配的延迟非参数强盗

背景土匪中的延迟奖励问题已经在各种实际情况中引起关注。我们研究了随机分配策略,并提供了有关观察奖励的延迟如何影响勘探与开发权衡的理解。在随机策略中,勘探开发的程度由用户确定的勘探概率序列控制。在存在延迟奖励的情况下,可以选择使用在每个时间点更新的原始探索顺序,或者仅在观察到新奖励时才更新顺序,从而导致两种竞争策略。在这项工作中,我们表明,尽管这两种策略都可能导致分配的高度一致性,但该属性在后者的情况下具有更广泛的适用范围。但是,对于有限样本性能,
更新日期:2021-04-01
down
wechat
bug