当前位置: X-MOL 学术IEEE Trans. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cost-aware Cascading Bandits
IEEE Transactions on Signal Processing ( IF 5.4 ) Pub Date : 2020-01-01 , DOI: 10.1109/tsp.2020.3001388
Chao Gan , Ruida Zhou , Jing Yang , Cong Shen

In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an ordered list of items and examines them sequentially, until certain stopping condition is satisfied. Our objective is then to maximize the expected net reward in each step, i.e., the reward obtained in each step minus the total cost incurred in examining the items, by deciding the ordered list of items, as well as when to stop examination. We first consider the setting where the instantaneous cost of pulling an arm is unknown to the learner until it has been pulled. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the offline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cascading Upper Confidence Bound (CC-UCB) algorithm, and show that the cumulative regret scales in $O(\log T)$. We also provide a lower bound for all $\alpha$-consistent policies, which scales in $\Omega (\log T)$ and matches our upper bound. We then investigate the setting where the instantaneous cost of pulling each arm is available to the learner for its decision-making, and show that a slight modification of the CC-UCB algorithm, termed as CC-UCB2, is order-optimal. The performances of the algorithms are evaluated with both synthetic and real-world data.

中文翻译:

成本意识级联强盗

在本文中,我们通过考虑拉臂的随机成本,提出了一种成本感知级联老虎机模型,这是一种具有级联反馈的多臂老虎机的新变体。在每一步中,学习代理选择一个订购项目列表并依次检查它们,直到满足某些停止条件。我们的目标是最大化预期净回报在每一步,即每一步获得的奖励减去检查项目产生的总成本,通过决定项目的有序列表,以及何时停止检查。我们首先考虑这样一种情况,即在拉动手臂之前,学习者不知道拉动手臂的瞬时成本。我们研究离线和在线设置,这取决于项目的状态和成本统计信息是否事先已知。对于离线设置,我们表明阈值为 1 的单位成本排名 (UCR-T1) 策略是最佳的。对于在线设置,我们提出了一种成本感知级联上置信界(CC-UCB)算法,并表明累积后悔在$O(\log T)$. 我们还为所有人提供了一个下限$\alpha$- 一致的政策,可扩展 $\Omega (\log T)$并匹配我们的上限。然后,我们研究了学习者可以使用拉动每条手臂的瞬时成本来进行决策的设置,并表明对 CC-UCB 算法的轻微修改,称为 CC-UCB2,是顺序最优的。算法的性能通过合成数据和真实数据进行评估。
更新日期:2020-01-01
down
wechat
bug