当前位置: X-MOL 学术arXiv.cs.DS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
C-MinHash: Practically Reducing Two Permutations to Just One
arXiv - CS - Data Structures and Algorithms Pub Date : 2021-09-10 , DOI: arxiv-2109.04595
Xiaoyun Li, Ping Li

Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data, where $K$ can be (e.g.,) 1024 or even larger, depending on applications. The recent work on C-MinHash (Li and Li, 2021) has shown, with rigorous proofs, that only two permutations are needed. An initial permutation is applied to break whatever structures which might exist in the data, and a second permutation is re-used $K$ times to produce $K$ hashes, via a circulant shifting fashion. (Li and Li, 2021) has proved that, perhaps surprisingly, even though the $K$ hashes are correlated, the estimation variance is strictly smaller than the variance of the traditional MinHash. It has been demonstrated in (Li and Li, 2021) that the initial permutation in C-MinHash is indeed necessary. For the ease of theoretical analysis, they have used two independent permutations. In this paper, we show that one can actually simply use one permutation. That is, one single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes. Although the theoretical analysis becomes very complicated, we are able to explicitly write down the expression for the expectation of the estimator. The new estimator is no longer unbiased but the bias is extremely small and has essentially no impact on the estimation accuracy (mean square errors). An extensive set of experiments are provided to verify our claim for using just one permutation.

中文翻译:

C-MinHash:实际上将两个排列减少到只有一个

传统的最小散列 (MinHash) 需要应用 $K$ 独立排列来估计海量二进制 (0/1) 数据中的 Jaccard 相似度,其中 $K$ 可以是(例如)1024 甚至更大,具体取决于应用程序。最近关于 C-MinHash 的工作(Li 和 Li,2021 年)表明,经过严格的证明,只需要两个排列。应用初始排列来破坏数据中可能存在的任何结构,并通过循环移位方式重复使用第二个排列 $K$ 次以生成 $K$ 哈希。(Li and Li, 2021) 已经证明,也许令人惊讶的是,即使 $K$ 散列是相关的,估计方差也严格小于传统 MinHash 的方差。在 (Li and Li, 2021) 中已经证明,C-MinHash 中的初始排列确实是必要的。为了便于理论分析,他们使用了两个独立的排列。在本文中,我们表明实际上可以简单地使用一种排列。也就是说,一个单一的排列既用于初始预处理步骤以破坏数据中的结构,又用于循环散列步骤以生成 $K$ 散列。虽然理论分析变得非常复杂,但我们能够明确地写出估计量期望的表达式。新的估算器不再是无偏的,但偏差非常小,对估算精度(均方误差)基本上没有影响。提供了一组广泛的实验来验证我们仅使用一个排列的主张。也就是说,一个单一的排列既用于初始预处理步骤以破坏数据中的结构,又用于循环散列步骤以生成 $K$ 散列。虽然理论分析变得非常复杂,但我们能够明确地写出估计量期望的表达式。新的估算器不再是无偏的,但偏差非常小,对估算精度(均方误差)基本上没有影响。提供了一组广泛的实验来验证我们仅使用一个排列的主张。也就是说,一个单一的排列既用于初始预处理步骤以破坏数据中的结构,又用于循环散列步骤以生成 $K$ 散列。虽然理论分析变得非常复杂,但我们能够明确地写出估计量期望的表达式。新的估算器不再是无偏的,但偏差非常小,对估算精度(均方误差)基本上没有影响。提供了一组广泛的实验来验证我们仅使用一个排列的主张。我们能够明确地写下估计量期望的表达式。新的估算器不再是无偏的,但偏差非常小,对估算精度(均方误差)基本上没有影响。提供了一组广泛的实验来验证我们仅使用一个排列的主张。我们能够明确地写下估计量期望的表达式。新的估算器不再是无偏的,但偏差非常小,对估算精度(均方误差)基本上没有影响。提供了一组广泛的实验来验证我们仅使用一个排列的主张。
更新日期:2021-09-13
down
wechat
bug