Dominate or Delete: Decentralized Competing Bandits with Uniform Valuation,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dominate or Delete: Decentralized Competing Bandits with Uniform Valuation
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-06-26 , DOI: arxiv-2006.15166
Abishek Sankararaman, Soumya Basu, Karthik Abinav Sankararaman

We study regret minimization problems in a two-sided matching market where uniformly valued demand side agents (a.k.a. agents) continuously compete for getting matched with supply side agents (a.k.a. arms) with unknown and heterogeneous valuations. Such markets abstract online matching platforms (for e.g. UpWork, TaskRabbit) and falls within the purview of matching bandit models introduced in Liu et al. \cite{matching_bandits}. The uniform valuation in the demand side admits a unique stable matching equilibrium in the system. We design the first decentralized algorithm - \fullname\; (\name), for matching bandits under uniform valuation that does not require any knowledge of reward gaps or time horizon, and thus partially resolves an open question in \cite{matching_bandits}. \name\; works in phases of exponentially increasing length. In each phase $i$, an agent first deletes dominated arms -- the arms preferred by agents ranked higher than itself. Deletion follows dynamic explore-exploit using UCB algorithm on the remaining arms for $2^i$ rounds. {Finally, the preferred arm is broadcast in a decentralized fashion to other agents through {\em pure exploitation} in $(N-1)K$ rounds with $N$ agents and $K$ arms.} Comparing the obtained reward with respect to the unique stable matching, we show that \name\; achieves $O(\log(T)/\Delta^2)$ regret in $T$ rounds, where $\Delta$ is the minimum gap across all agents and arms. We provide a (orderwise) matching regret lower-bound.

中文翻译：

支配或删除：具有统一估值的分散竞争的土匪

我们研究了一个双向匹配市场中最小化后悔问题的情况，在该市场中，统一估值的需求方代理商（又名代理商）持续竞争与估值未知和异构的供应方代理商（又称武器）的匹配。这样的市场抽象了在线匹配平台（例如UpWork，TaskRabbit），属于Liu等人介绍的匹配强盗模型的范围。\ cite {matching_bandits}。需求侧的统一评估允许系统中具有唯一的稳定匹配平衡。我们设计了第一个去中心化算法-\ fullname \; （\ name），用于在统一估值下匹配匪徒，不需要任何关于奖励差距或时间范围的知识，因此可以部分解决\ cite {matching_bandits}中的一个悬而未决的问题。\名称\; 在成倍增加长度的阶段中工作。在每个阶段$ i $中，代理首先删除占主导地位的手臂-代理商所喜欢的手臂排名高于自身。在$ 2 ^ i $次回合中，在其余手臂上使用UCB算法进行动态探索-利用后进行删除。{最后，在{（N-1）K $与$ N $代理商和$ K $武器的回合中，首选武器通过{\ em pure exploit}以分散方式广播给其他代理商。}比较获得的奖赏对于唯一的稳定匹配，我们显示\ name \; 在$ T $回合中实现$ O（\ log（T）/ \ Delta ^ 2）$的遗憾，其中$ \ Delta $是所有特工和军备之间的最小差距。我们提供（按顺序）匹配的遗憾下限。首选的分支将通过{\ em pure exploit}以分散的方式通过$（N-1）K $轮次与$ N $代理和$ K $分支广播给其他代理。}比较获得的奖励与唯一性稳定匹配，我们显示\ name \; 在$ T $回合中实现$ O（\ log（T）/ \ Delta ^ 2）$的遗憾，其中$ \ Delta $是所有特工和军备之间的最小差距。我们提供（按顺序）匹配的遗憾下限。首选的分支将通过{\ em pure exploit}以分散的方式通过$（N-1）K $轮次与$ N $代理和$ K $分支广播给其他代理。}比较获得的奖励与唯一性稳定匹配，我们显示\ name \; 在$ T $回合中实现$ O（\ log（T）/ \ Delta ^ 2）$的遗憾，其中$ \ Delta $是所有特工和军备之间的最小差距。我们提供（按顺序）匹配的遗憾下限。

更新日期：2020-06-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文