Dominate or Delete: Decentralized Competing Bandits with Uniform Valuation,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dominate or Delete: Decentralized Competing Bandits with Uniform Valuation
arXiv - CS - Computer Science and Game Theory Pub Date : 2020-06-26 , DOI: arxiv-2006.15166
Abishek Sankararaman, Soumya Basu, Karthik Abinav Sankararaman

We study regret minimization problems in a two-sided matching market where uniformly valued demand side agents (a.k.a. agents) continuously compete for getting matched with supply side agents (a.k.a. arms) with unknown and heterogeneous valuations. Such markets abstract online matching platforms (for e.g. UpWork, TaskRabbit) and falls within the purview of matching bandit models introduced in Liu et al. \cite{matching_bandits}. The uniform valuation in the demand side admits a unique stable matching equilibrium in the system. We design the first decentralized algorithm - \fullname\; (\name), for matching bandits under uniform valuation that does not require any knowledge of reward gaps or time horizon, and thus partially resolves an open question in \cite{matching_bandits}. \name\; works in phases of exponentially increasing length. In each phase $i$, an agent first deletes dominated arms -- the arms preferred by agents ranked higher than itself. Deletion follows dynamic explore-exploit using UCB algorithm on the remaining arms for $2^i$ rounds. {Finally, the preferred arm is broadcast in a decentralized fashion to other agents through {\em pure exploitation} in $(N-1)K$ rounds with $N$ agents and $K$ arms.} Comparing the obtained reward with respect to the unique stable matching, we show that \name\; achieves $O(\log(T)/\Delta^2)$ regret in $T$ rounds, where $\Delta$ is the minimum gap across all agents and arms. We provide a (orderwise) matching regret lower-bound.

中文翻译：

支配或删除：具有统一估值的分散竞争强盗

我们研究了双边匹配市场中的遗憾最小化问题，在该市场中，统一估值的需求方代理（又名代理）不断竞争与具有未知和异质估值的供应方代理（又名武器）匹配。这些市场抽象了在线匹配平台（例如 UpWork、TaskRabbit）并且属于 Liu 等人引入的匹配老虎机模型的范围。\cite{matching_bandits}。需求方的统一估价承认系统中存在独特的稳定匹配均衡。我们设计了第一个去中心化算法 - \fullname\; (\name)，用于在统一估值下匹配老虎机，不需要任何奖励差距或时间范围的知识，因此部分解决了 \cite{matching_bandits} 中的一个悬而未决的问题。\名称\; 在长度呈指数增长的阶段工作。在每个阶段$i$，代理首先删除被支配的臂——代理优先级高于自身的臂。删除遵循在 $2^i$ 轮的剩余武器上使用 UCB 算法的动态探索-利用。{最后，在 $(N-1)K$ 轮中，通过 $(N-1)K$ 轮与 $N$ 代理和 $K$arm 以分散的方式将首选臂广播给其他代理。} 将获得的奖励与尊重进行比较对于唯一的稳定匹配，我们证明 \name\; 在 $T$ 轮中实现 $O(\log(T)/\Delta^2)$ 后悔，其中 $\Delta$ 是所有代理和武器之间的最小差距。我们提供了一个（有序的）匹配后悔下限。在 $(N-1)K$ 轮中使用 $N$ 代理和 $K$ 武器，通过{\em 纯开发}以分散的方式将首选臂广播给其他代理。}比较获得的奖励与独特的稳定匹配，我们证明 \name\; 在 $T$ 轮中实现 $O(\log(T)/\Delta^2)$ 后悔，其中 $\Delta$ 是所有代理和武器之间的最小差距。我们提供了一个（有序的）匹配后悔下限。在 $(N-1)K$ 轮中使用 $N$ 代理和 $K$ 武器，通过{\em 纯开发}以分散的方式将首选臂广播给其他代理。}比较获得的奖励与独特的稳定匹配，我们证明 \name\; 在 $T$ 轮中实现 $O(\log(T)/\Delta^2)$ 后悔，其中 $\Delta$ 是所有代理和武器之间的最小差距。我们提供了一个（有序的）匹配后悔下限。

更新日期：2020-06-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>