当前位置: X-MOL 学术arXiv.cs.DS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning-based Support Estimation in Sublinear Time
arXiv - CS - Data Structures and Algorithms Pub Date : 2021-06-15 , DOI: arxiv-2106.08396
Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

中文翻译:

次线性时间中基于学习的支持估计

我们考虑从其元素的随机样本估计大型数据集中不同元素的数量(或等效地,由数据集引起的分布的支持大小)的问题。这个问题出现在许多应用中,包括生物学、基因组学、计算机系统和语言学。跨越过去十年的一系列研究产生了从大小为 $O(\log^2(1/\varepsilon) \cdot n/\log n) 的样本中估计高达 $ \pm \varepsilon n$ 的支持的算法$,其中 $n$ 是数据集大小。不幸的是,已知这个界限很紧,限制了对这个问题复杂性的进一步改进。在本文中,我们考虑使用基于机器学习的预测器增强的估计算法,该预测器在给定任何元素的情况下,返回对其频率的估计。我们表明,如果预测器在一个恒定的近似因子下是正确的,那么样本复杂度可以显着降低到 \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log (1/\varepsilon))}。\] 我们使用来自 {Hsu 等人,ICLR'19} 的基于神经网络的估计器作为预测器,在一组数据集上评估所提出的算法。我们的实验表明,与最先进的算法相比,估计精度有显着(高达 3 倍)的改进。
更新日期:2021-06-17
down
wechat
bug