当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives
Machine Learning ( IF 4.3 ) Pub Date : 2021-05-05 , DOI: 10.1007/s10994-021-05956-1
Alihan Hüyük , Cem Tekin

We consider multi-objective multi-armed bandit with (i) lexicographically ordered and (ii) satisficing objectives. In the first problem, the goal is to select arms that are lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. We capture this goal by defining a multi-dimensional form of regret that measures the loss due to not selecting lexicographic optimal arms, and then, propose an algorithm that achieves \({\tilde{O}}(T^{2/3})\) gap-free regret and prove a regret lower bound of \(\varOmega (T^{2/3})\). We also consider two additional settings where the learner has prior information on the expected arm rewards. In the first setting, the learner only knows for each objective the lexicographic optimal expected reward. In the second setting, it only knows for each objective a near-lexicographic optimal expected reward. For both settings, we prove that the learner achieves expected regret uniformly bounded in time. Then, we show that the algorithm we propose for the second setting of lexicographically ordered objectives with prior information also attains bounded regret for satisficing objectives. Finally, we experimentally evaluate the proposed algorithms in a variety of multi-objective learning problems.



中文翻译:

具有按字典顺序排序和令人满意的目标的多目标多臂匪

我们考虑具有(i)按字典顺序排序和(ii)令人满意的目标的多目标多武装匪徒。在第一个问题中,目标是在不事先知道手臂奖励分布的情况下,尽可能选择词典最佳的手臂。我们通过定义遗憾的多维形式来实现此目标,该遗憾形式可衡量由于未选择词典编目最优臂而造成的损失,然后提出一种算法,该算法可实现\({\ tilde {O}}(T ^ {2/3} )\)无间隙的遗憾并证明\(\ varOmega(T ^ {2/3})\)的遗憾下限。我们还考虑了另外两个设置,在这些设置中,学习者可以获得有关预期手臂奖励的先验信息。在第一种设置中,学习者仅对每个目标都知道词典编纂的最佳预期奖励。在第二种设置中,它只为每个目标知道近词典最佳预期奖励。对于这两种情况,我们证明学习者获得了预期的后悔,时间一致。然后,我们证明了我们为具有先验信息的按字典顺序排序的目标的第二种设置提出的算法,对于满足目标也产生了一定的遗憾。最后,我们在各种多目标学习问题中通过实验评估了所提出的算法。

更新日期:2021-05-06
down
wechat
bug