A cost–based analysis for risk–averse explore–then–commit finite–time bandits,IISE Transactions

当前位置： X-MOL 学术 › IISE Trans. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A cost–based analysis for risk–averse explore–then–commit finite–time bandits
IISE Transactions ( IF 2.6 ) Pub Date : 2021-04-06 , DOI: 10.1080/24725854.2021.1882014
Ali Yekkehkhany ₁ , Ebrahim Arian ₂ , Rakesh Nagi ₂ , Ilan Shomorony ₁

Affiliation

Abstract

In this article, a multi–armed bandit problem is studied in an explore–then–commit setting where the cost of pulling an arm in the experimentation (exploration) phase may not be negligible. Identifying the best arm after a pure experimentation phase to exploit it once or for a given finite number of times is the goal of the problem. Applications of this are prevalent in personalized health-care and financial investments where the frequency of exploitation is limited. In this setting, we observe that pulling the arm with the highest expected reward is not necessarily the most desirable objective for exploitation. Alternatively, we advocate the idea of risk aversion, where the objective is to compete against the arm with the best risk–return trade–off. Additionally, a trade–off between cost and regret should be considered in the case where pulling arms in the exploration phase incurs a cost. In the case that the exploration cost is not considered, we propose a class of hyper–parameter–free risk–averse algorithms, called OTE/FTE–MAB (One/Finite–Time Exploitation Multi–Armed Bandit), whose objectives are to select the arm that is most probable to reward the most in a single or finite–time exploitations. To analyze these algorithms, we define a new notion of finite–time exploitation regret for our setting of interest. We provide an upper bound of order ln $(\frac{1}{\in_{r}})$ for the minimum number of experiments that should be done to guarantee upper bound e_r for regret. As compared with existing risk–averse bandit algorithms, our algorithms do not rely on hyper–parameters, resulting in a more robust behavior in practice. In the case that pulling an arm in the exploration phase has a cost, we propose the c–OTE–MAB algorithm for two–armed bandits that addresses the cost–regret trade–off, corresponding to exploration–exploitation trade–off, by minimizing a linear combination of cost and regret that is called cost– regret function, using a hyper–parameter. This algorithm determines an estimation of the optimal number of explorations whose cost–regret value approaches the minimum value of the cost–regret function at the rate $\frac{1}{\sqrt{n_{e}}}$ with an associated confidence level, where n_e is the number of explorations of each arm.

中文翻译：

基于成本的风险分析 - 厌恶探索 - 然后 - 提交有限的时间强盗

摘要

在本文中，在探索 - 然后 - 提交设置中研究了多武装老虎机问题，其中在实验（探索）阶段拉动手臂的成本可能不可忽略。问题的目标是在纯粹的实验阶段之后确定最好的手臂来利用它一次或给定的有限次数。这种应用在个性化医疗保健和金融投资中普遍存在，其中剥削的频率有限。在这种情况下，我们观察到以最高预期奖励拉动手臂不一定是最理想的剥削目标。或者，我们提倡风险规避的想法，其目标是与具有最佳风险 - 回报权衡 - 的手臂竞争。此外，如果在探索阶段拉动武器会产生成本，则应考虑成本和遗憾之间的权衡。在不考虑探索成本的情况下，我们提出了一类超参数-自由风险-规避算法，称为OTE/FTE-MAB（One/Finite-Time Exploitation Multi-Armed Bandit），其目标是选择最有可能在单次或有限时间开发中获得最多回报的手臂。为了分析这些算法，我们为我们感兴趣的环境定义了一个新的有限时间开发遗憾的概念。我们提供阶 ln 的上限其目标是选择最有可能在单一或有限时间开发中获得最多回报的手臂。为了分析这些算法，我们为我们感兴趣的环境定义了一个新的有限时间开发遗憾的概念。我们提供阶 ln 的上限其目标是选择最有可能在单一或有限时间开发中获得最多回报的手臂。为了分析这些算法，我们为我们感兴趣的环境定义了一个新的有限时间开发遗憾的概念。我们提供阶 ln 的上限 $(\frac{1}{\in_{r}})$ 对于应该做保证上限Ë实验的最小数目_[R遗憾。与现有的风险规避老虎机算法相比，我们的算法不依赖于超参数，从而在实践中产生更稳健的行为。在探索阶段拉臂有代价的情况下，我们提出了 c-OTE-MAB 算法针对两个武装匪徒，解决了代价-后悔权衡，对应于探索-开发权衡，通过最小化成本和遗憾的线性组合，称为成本-遗憾函数，使用超参数。该算法确定最佳探索次数的估计，其成本 - 后悔值接近成本 - 后悔函数的最小值 $\frac{1}{\sqrt{n_{和}}}$ 具有相关的置信度，其中 n _e是每个臂的探索次数。

更新日期：2021-04-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>