A Cost-Based Analysis for Risk-Averse Explore-Then-Commit Finite-Time Bandits,IISE Transactions

当前位置： X-MOL 学术 › IISE Trans. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Cost-Based Analysis for Risk-Averse Explore-Then-Commit Finite-Time Bandits
IISE Transactions ( IF 2.0 ) Pub Date : 2021-02-02
Ali Yekkehkhany, Ebrahim Arian, Rakesh Nagi, Ilan Shomorony

ABSTRACT

In this paper, a multi-armed bandit problem is studied in an explore-then-commit setting where the cost of pulling an arm in the experimentation (exploration) phase may not be negligible. Identifying the best arm after a pure experimentation phase to exploit it once or for a given finite number of times is the goal of the problem. Applications of this are prevalent in personalized healthcare and financial investments where the frequency of exploitation is limited. In this setting, we observe that pulling the arm with the highest expected reward is not necessarily the most desirable objective for exploitation. Alternatively, we advocate the idea of risk aversion where the objective is to compete against the arm with the best risk-return trade-off. Additionally, a trade-off between cost and regret should be considered in the case where pulling arms in the exploration phase incurs a cost. In the case that the exploration cost is not considered, we propose a class of hyper-parameter-free risk-averse algorithms, called OTE/FTE-MAB (One/Finite-Time Exploitation Multi-Armed Bandit), whose objectives are to select the arm that is most probable to reward the most in a single or finite-time exploitations. To analyze these algorithms, we define a new notion of finite-time exploitation regret for our setting of interest. We provide an upper bound of order $ln (\frac{1}{ϵ_{r}})$ for the minimum number of experiments that should be done to guarantee upper bound $ϵ_{r}$ for regret. As compared to existing risk-averse bandit algorithms, our algorithms do not rely on hyper-parameters, resulting in a more robust behavior in practice. In the case that pulling an arm in the exploration phase has a cost, we propose the c-OTE-MAB algorithm for two-armed bandits that addresses the cost-regret trade-off, corresponding to exploration-exploitation trade-off, by minimizing a linear combination of cost and regret that is called cost-regret function, using a hyper-parameter. This algorithm determines an estimation of the optimal number of explorations whose cost-regret value approaches the minimum value of the cost-regret function at the rate $\frac{1}{\sqrt{n_{e}}}$ with an associated confidence level, where n_e is the number of explorations of each arm.

中文翻译：

基于成本的风险规避探索然后提交有限时间土匪分析

摘要

在本文中，在“探索-然后提交”环境中研究了多臂匪问题，在这种情况下，在实验（探索）阶段拉动手臂的成本可能微不足道。问题的目标是在一个纯实验阶段之后确定最佳分支以对其进行一次或给定的有限次数的利用。在剥削频率受到限制的个性化医疗保健和金融投资中，这种应用很普遍。在这种情况下，我们观察到以最高预期报酬拉动手臂不一定是最可取的剥削目标。另外，我们提倡风险规避的思想，其目的是与最佳风险收益权衡相抗衡。另外，如果在勘探阶段拉动武器会产生成本，则应在成本与遗憾之间进行权衡。在不考虑勘探成本的情况下，我们提出了一类超参数无风险规避算法，称为OTE / FTE-MAB（一次/有限时间开采多武器强盗），其目标是选择在单个或有限时间的开发中最有可能获得最大回报的手臂。为了分析这些算法，我们为我们的兴趣设置了一个新的有限时间利用遗憾概念。我们提供订单上限其目标是选择最有可能在单次或限时开采中获得最大回报的手臂。为了分析这些算法，我们为我们的兴趣设置了一个新的有限时间利用遗憾概念。我们提供订单上限其目标是选择最有可能在单次或限时开采中获得最大回报的手臂。为了分析这些算法，我们为我们的兴趣设置了一个新的有限时间利用遗憾概念。我们提供订单上限 $ln （ \frac{1个}{ϵ_{[R}} ）$ 为保证上限而应进行的最少实验次数 $ϵ_{[R}$ 感到遗憾与现有的规避风险的强盗算法相比，我们的算法不依赖超参数，因此在实践中会产生更强大的行为。如果在勘探阶段拉动手臂需要付出代价，我们提出了一种针对两臂土匪的c-OTE-MAB算法，该算法通过最小化来解决成本－遗憾的权衡，这对应于勘探与开发的权衡。使用超参数的成本与遗憾的线性组合，称为成本后悔函数。此算法确定最优探索次数的估计，该探索的成本后悔值接近该比率的成本后悔函数的最小值 $\frac{1个}{\sqrt{ñ_{Ë}}}$ 具有相关的置信度，其中n _e 是每个分支的探索次数。

更新日期：2021-02-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11