Perfect $L_p$ Sampling in a Data Stream,SIAM Journal on Computing

当前位置： X-MOL 学术 › SIAM J. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Perfect $L_p$ Sampling in a Data Stream
SIAM Journal on Computing ( IF 1.2 ) Pub Date : 2021-03-30 , DOI: 10.1137/18m1229912
Rajesh Jayaram , David Woodruff

SIAM Journal on Computing, Volume 50, Issue 2, Page 382-439, January 2021.
In this paper, we resolve the one-pass space complexity of perfect $L_p$ sampling for $p \in (0,2)$ in a stream. Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector $f \in \mathbb{R}^n$, a perfect $L_p$ sampler must output an index $i$ with probability $|f_i|^p/\|f\|_p^p$ and is allowed to fail with some probability $\delta$. So far, for $p > 0$ no algorithm has been shown to solve the problem exactly using ${poly}( \log n)$-bits of space. In 2010, Monemizadeh and Woodruff introduced an approximate $L_p$ sampler which, given an approximation parameter $\nu$, outputs $i$ with probability $(1 \pm \nu)|f_i|^p /\|f\|_p^p$, using space polynomial in $\nu^{-1}$ and $\log(n)$. The space complexity was later reduced by Jowhari, Sağlam, and Tardos to roughly $O(\nu^{-p} \log^2 n \log \delta^{-1})$ for $p \in (0,2)$, which matches the general $p\geq 0$ lower bound of $\Omega(\log^2 n \log \delta^{-1})$ in terms of $n$ and $\delta$, but is loose in terms of $\nu$. Given these nearly tight bounds, it is perhaps surprising that no lower bound exists in terms of $\nu$---not even a bound of $\Omega(\nu^{-1})$ is known. In this paper, we explain this phenomenon by demonstrating the existence of an $O(\log^2 n \log \delta^{-1})$-bit perfect $L_p$ sampler for $p \in (0,2)$. This shows that $\nu$ need not factor into the space of an $L_p$ sampler, which closes the complexity of the problem for this range of $p$. For $p=2$, our bound is $O(\log^3 n \log \delta^{-1})$-bits, which matches the prior best known upper bound of $O(\nu^{-2}\log^3n \log \delta^{-1})$, but has no dependence on $\nu$. Note that there is still a $\log n$ gap between our upper bound and the lower bound for $p=2$, the ution of which we leave as an open problem. For $p<2$, our bound holds in the random oracle model, matching the lower bounds in that model. However, we show that our algorithm can be derandomized with only a $O((\log \log n)^2)$ blow-up in the space (and no blow-up for $p=2$). Our derandomization technique is quite general, and can be used to derandomize a large class of linear sketches, including the more accurate count-sketch variant of Minton and Price [Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, Philadelphia, 2014, pp. 669--686], resolving an open question in that paper. Finally, we show that a $(1\pm\epsilon)$ relative error estimate of the frequency $f_i$ of the sampled index $i$ can be obtained using an additional $O(\epsilon^{-p} \log n)$-bits of space for $p < 2$, and $O( \epsilon^{-2} \log^2 n)$ bits for $p=2$, which was possible before only by running the prior algorithms with $\nu = \epsilon$.

中文翻译：

数据流中的完美 $L_p$ 采样

SIAM Journal on Computing，第 50 卷，第 2 期，第 382-439 页，2021 年 1 月。
在本文中，我们解决了流中 $p \in (0,2)$ 完美 $L_p$ 采样的单遍空间复杂度。给定对底层向量 $f \in \mathbb{R}^n$ 坐标的更新（插入和删除）流，完美的 $L_p$ 采样器必须以概率 $|f_i|^ 输出索引 $i$ p/\|f\|_p^p$ 并且允许以一定的概率 $\delta$ 失败。到目前为止，对于 $p > 0$ 还没有显示出算法可以完全使用 ${poly}( \log n)$ 位空间来解决问题。2010 年，Monemizadeh 和 Woodruff 引入了一个近似 $L_p$ 采样器，给定一个近似参数 $\nu$，输出 $i$ 的概率为 $(1 \pm \nu)|f_i|^p /\|f\|_p ^p$，在 $\nu^{-1}$ 和 $\log(n)$ 中使用空间多项式。空间复杂度后来被乔哈里、萨格拉姆、和 Tardos 大约为 $O(\nu^{-p} \log^2 n \log \delta^{-1})$ 为 $p \in (0,2)$，这与一般的 $p\geq 匹配0$ 下界 $\Omega(\log^2 n \log \delta^{-1})$ 就 $n$ 和 $\delta$ 而言，但就 $\nu$ 而言是松散的。考虑到这些近乎严格的界限，就 $\nu$ 而言不存在下界可能令人惊讶——甚至不知道 $\Omega(\nu^{-1})$ 的界限。在本文中，我们通过证明存在 $O(\log^2 n \log \delta^{-1})$-bit 完美 $L_p$ 采样器来解释这种现象，用于 $p \in (0,2) $. 这表明 $\nu$ 不需要考虑到 $L_p$ 采样器的空间，这关闭了 $p$ 范围内问题的复杂性。对于 $p=2$，我们的边界是 $O(\log^3 n \log \delta^{-1})$-bits，它匹配先前最知名的 $O(\nu^{-2 }\log^3n \log \delta^{-1})$, 但不依赖于 $\nu$。请注意，当 $p=2$ 时，我们的上限和下限之间仍然存在 $\log n$ 差距，我们将其作为一个未解决的问题。对于 $p<2$，我们的界限在随机预言机模型中成立，匹配该模型中的下界。然而，我们表明我们的算法可以在空间中仅通过 $O((\log \log n)^2)$ 爆炸（并且 $p=2$ 没有爆炸）进行去随机化。我们的去随机化技术非常通用，可用于对一大类线性草图进行去随机化，包括更准确的 Minton 和 Price 的计数草图变体 [第 25 届年度 ACM-SIAM 离散算法研讨会论文集，SIAM，费城， 2014, pp. 669--686]，解决了该论文中的一个悬而未决的问题。最后，

更新日期：2021-06-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11