Learning-based Support Estimation in Sublinear Time,arXiv - CS - Data Structures and Algorithms

当前位置： X-MOL 学术 › arXiv.cs.DS › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning-based Support Estimation in Sublinear Time
arXiv - CS - Data Structures and Algorithms Pub Date : 2021-06-15 , DOI: arxiv-2106.08396
Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

中文翻译：

次线性时间中基于学习的支持估计

我们考虑从其元素的随机样本估计大型数据集中不同元素的数量（或等效地，由数据集引起的分布的支持大小）的问题。这个问题出现在许多应用中，包括生物学、基因组学、计算机系统和语言学。跨越过去十年的一系列研究产生了从大小为 $O(\log^2(1/\varepsilon) \cdot n/\log n) 的样本中估计高达 $ \pm \varepsilon n$ 的支持的算法$，其中 $n$ 是数据集大小。不幸的是，已知这个界限很紧，限制了对这个问题复杂性的进一步改进。在本文中，我们考虑使用基于机器学习的预测器增强的估计算法，该预测器在给定任何元素的情况下，返回对其频率的估计。我们表明，如果预测器在一个恒定的近似因子下是正确的，那么样本复杂度可以显着降低到 \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log (1/\varepsilon))}。\] 我们使用来自 {Hsu 等人，ICLR'19} 的基于神经网络的估计器作为预测器，在一组数据集上评估所提出的算法。我们的实验表明，与最先进的算法相比，估计精度有显着（高达 3 倍）的改进。

更新日期：2021-06-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>