当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Sampling from a $k$-DPP without looking at all items
arXiv - CS - Machine Learning Pub Date : 2020-06-30 , DOI: arxiv-2006.16947
Daniele Calandriello, Micha{\l} Derezi\'nski, Michal Valko

Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more. Given a kernel function and a subset size $k$, our goal is to sample $k$ out of $n$ items with probability proportional to the determinant of the kernel matrix induced by the subset (a.k.a. $k$-DPP). Existing $k$-DPP sampling algorithms require an expensive preprocessing step which involves multiple passes over all $n$ items, making it infeasible for large datasets. A na\"ive heuristic addressing this problem is to uniformly subsample a fraction of the data and perform $k$-DPP sampling only on those items, however this method offers no guarantee that the produced sample will even approximately resemble the target distribution over the original dataset. In this paper, we develop an algorithm which adaptively builds a sufficiently large uniform sample of data that is then used to efficiently generate a smaller set of $k$ items, while ensuring that this set is drawn exactly from the target distribution defined on all $n$ items. We show empirically that our algorithm produces a $k$-DPP sample after observing only a small fraction of all elements, leading to several orders of magnitude faster performance compared to the state-of-the-art.

中文翻译:

从 $k$-DPP 抽样而不查看所有项目

行列式点过程 (DPP) 是一种有用的概率模型,用于从大量项目中选择一个小的不同子集,可应用于汇总、随机优化、主动学习等。给定一个核函数和一个子集大小 $k$,我们的目标是从 $n$ 个项目中采样 $k$,其概率与子集(又名 $k$-DPP)诱导的核矩阵的行列式成正比。现有的 $k$-DPP 采样算法需要一个昂贵的预处理步骤,其中涉及对所有 $n$ 项的多次传递,这使得它对于大型数据集不可行。解决这个问题的一个简单的启发式方法是对一小部分数据进行统一子采样,并仅对这些项目执行 $k$-DPP 采样,然而,这种方法不能保证生成的样本甚至与原始数据集上的目标分布大致相似。在本文中,我们开发了一种算法,该算法自适应地构建足够大的均匀数据样本,然后用于有效地生成较小的 $k$ 项集,同时确保该集完全来自定义在所有 $k$ 上的目标分布。 n$ 项。我们凭经验表明,我们的算法在仅观察所有元素的一小部分后产生了 $k$-DPP 样本,与最先进的技术相比,性能提高了几个数量级。同时确保该集合完全来自定义在所有 $n$ 项上的目标分布。我们凭经验表明,我们的算法在仅观察所有元素的一小部分后产生了 $k$-DPP 样本,与最先进的技术相比,性能提高了几个数量级。同时确保该集合完全来自定义在所有 $n$ 项上的目标分布。我们凭经验表明,我们的算法在仅观察所有元素的一小部分后产生了 $k$-DPP 样本,与最先进的技术相比,性能提高了几个数量级。
更新日期:2020-07-01
down
wechat
bug