Risk-Aware Multi-Armed Bandits With Refined Upper Confidence Bounds,IEEE Signal Processing Letters

当前位置： X-MOL 学术 › IEEE Signal Process. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Risk-Aware Multi-Armed Bandits With Refined Upper Confidence Bounds
IEEE Signal Processing Letters ( IF 3.9 ) Pub Date : 2020-12-28 , DOI: 10.1109/lsp.2020.3047725
Xingchi Liu , Mahsa Derakhshani , Sangarapillai Lambotharan , Mihaela van der Schaar

The classical multi-armed bandit (MAB) framework studies the exploration-exploitation dilemma of the decision-making problem and always treats the arm with the highest expected reward as the optimal choice. However, in some applications, an arm with a high expected reward can be risky to play if the variance is high. Hence, the variation of the reward should be considered to make the arm-selection process risk-aware. In this letter, the mean-variance metric is investigated to measure the uncertainty of the received rewards. We first study a risk-aware MAB problem when the reward follows a Gaussian distribution, and a concentration inequality on the variance is developed to design a Gaussian risk aware-upper confidence bound algorithm. Furthermore, we extend this algorithm to a novel asymptotic risk aware-upper confidence bound algorithm by developing an upper confidence bound of the variance based on the asymptotic distribution of the sample variance. Theoretical analysis proves that both proposed algorithms achieve the

$\mathcal {O}(\log (T))$

regret. Finally, numerical results demonstrate that our algorithms outperform several risk-aware MAB algorithms.

中文翻译：

具风险意识的多武装强盗，具有精确的上限置信区间

经典的多臂土匪（MAB）框架研究决策问题的勘探开发困境，并始终将具有最高预期报酬的臂视为最佳选择。但是，在某些应用中，如果方差较大，则具有较高预期报酬的手臂可能会有冒险的风险。因此，应考虑奖励的变化以使选臂过程具有风险意识。在这封信中，研究了均方差度量以测量所收到奖励的不确定性。当奖励遵循高斯分布时，我们首先研究了一个风险感知的MAB问题，并建立了方差集中不等式来设计高斯风险感知-上置信度界算法。此外，通过基于样本方差的渐近分布来开发方差的上置信度边界，我们将该算法扩展为一种新颖的渐近风险感知-上置信度边界算法。理论分析证明，两种提出的算法均达到了要求。

$ \ mathcal {O}（\ log（T））$

后悔。最后，数值结果表明，我们的算法优于几种风险感知的MAB算法。

更新日期：2021-02-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南