The space complexity of inner product filters,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The space complexity of inner product filters
arXiv - CS - Databases Pub Date : 2019-09-24 , DOI: arxiv-1909.10766
Rasmus Pagh, Johan Sivertsen

Motivated by the problem of filtering candidate pairs in inner product similarity joins we study the following inner product estimation problem: Given parameters $d\in {\bf N}$, $\alpha>\beta\geq 0$ and unit vectors $x,y\in {\bf R}^{d}$ consider the task of distinguishing between the cases $\langle x, y\rangle\leq\beta$ and $\langle x, y\rangle\geq \alpha$ where $\langle x, y\rangle = \sum_{i=1}^d x_i y_i$ is the inner product of vectors $x$ and $y$. The goal is to distinguish these cases based on information on each vector encoded independently in a bit string of the shortest length possible. In contrast to much work on compressing vectors using randomized dimensionality reduction, we seek to solve the problem deterministically, with no probability of error. Inner product estimation can be solved in general via estimating $\langle x, y\rangle$ with an additive error bounded by $\varepsilon = \alpha - \beta$. We show that $d \log_2 \left(\tfrac{\sqrt{1-\beta}}{\varepsilon}\right) \pm \Theta(d)$ bits of information about each vector is necessary and sufficient. Our upper bound is constructive and improves a known upper bound of $d \log_2(1/\varepsilon) + O(d)$ by up to a factor of 2 when $\beta$ is close to $1$. The lower bound holds even in a stronger model where one of the vectors is known exactly, and an arbitrary estimation function is allowed.

中文翻译：

内积滤波器的空间复杂度

受内积相似性连接中候选对过滤问题的启发，我们研究以下内积估计问题：给定参数 $d\in {\bf N}$, $\alpha>\beta\geq 0$ 和单位向量 $x ,y\in {\bf R}^{d}$ 考虑区分情况 $\langle x, y\rangle\leq\beta$ 和 $\langle x, y\rangle\geq \alpha$ 的任务，其中$\langle x, y\rangle = \sum_{i=1}^d x_i y_i$ 是向量$x$ 和$y$ 的内积。目标是根据每个向量的信息来区分这些情况，每个向量都以尽可能短的位串独立编码。与使用随机降维压缩向量的大量工作相比，我们寻求确定性地解决问题，没有错误的可能性。内积估计一般可以通过估计 $\langle x 来解决，y\rangle$ 具有以 $\varepsilon = \alpha - \beta$ 为界的附加误差。我们证明 $d \log_2 \left(\tfrac{\sqrt{1-\beta}}{\varepsilon}\right) \pm \Theta(d)$ 关于每个向量的信息位是必要和充分的。我们的上限是建设性的，当 $\beta$ 接近 $1$ 时，将 $d \log_2(1/\varepsilon) + O(d)$ 的已知上限提高了 2 倍。即使在其中一个向量准确已知的更强模型中，下界也成立，并且允许使用任意估计函数。我们的上限是建设性的，当 $\beta$ 接近 $1$ 时，将 $d \log_2(1/\varepsilon) + O(d)$ 的已知上限提高了 2 倍。即使在其中一个向量准确已知的更强模型中，下界也成立，并且允许使用任意估计函数。我们的上限是建设性的，当 $\beta$ 接近 $1$ 时，将 $d \log_2(1/\varepsilon) + O(d)$ 的已知上限提高了 2 倍。即使在其中一个向量准确已知的更强模型中，下界也成立，并且允许使用任意估计函数。

更新日期：2020-01-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>