当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search
The VLDB Journal ( IF 2.8 ) Pub Date : 2021-07-03 , DOI: 10.1007/s00778-021-00680-7
Bolong Zheng 1 , Xi Zhao 1 , Lianggui Weng 1 , Quoc Viet Hung Nguyen 2 , Hang Liu 3 , Christian S. Jensen 4
Affiliation  

Nearest neighbor (NN) search is inherently computationally expensive in high-dimensional spaces due to the curse of dimensionality. As a well-known solution, locality-sensitive hashing (LSH) is able to answer c-approximate NN (c-ANN) queries in sublinear time with constant probability. Existing LSH methods focus mainly on building hash bucket-based indexing such that the candidate points can be retrieved quickly. However, existing coarse-grained structures fail to offer accurate distance estimation for candidate points, which translates into additional computational overhead when having to examine unnecessary points. This in turn reduces the performance of query processing. In contrast, we propose a fast and accurate in-memory LSH framework, called PM-LSH, that aims to compute the c-ANN query on large-scale, high-dimensional datasets. First, we adopt a simple yet effective PM-tree to index the data points. Second, we develop a tunable confidence interval to achieve accurate distance estimation and guarantee high result quality. Third, we propose an efficient algorithm on top of the PM-tree to improve the performance of computing c-ANN queries. In addition, we extend PM-LSH to support closest pair (CP) search in high-dimensional spaces. Here, we again adopt the PM-tree to organize the points in a low-dimensional space, and we propose a branch and bound algorithm together with a radius pruning technique to improve the performance of computing c-approximate closest pair (c-ACP) queries. Extensive experiments with real-world data offer evidence that PM-LSH is capable of outperforming existing proposals with respect to both efficiency and accuracy for both NN and CP search.



中文翻译:

PM-LSH:用于高维近似神经网络和最近对搜索的快速准确的内存框架

由于维数灾难,最近邻 (NN) 搜索在高维空间中固有的计算成本很高。作为众所周知的解决方案,局部敏感哈希(LSH)能够以恒定概率在次线性时间内回答c -近似神经网络(c -ANN)查询。现有的 LSH 方法主要侧重于构建基于哈希桶的索引,以便可以快速检索候选点。然而,现有的粗粒度结构无法为候选点提供准确的距离估计,这在必须检查不必要的点时转化为额外的计算开销。这反过来又会降低查询处理的性能。相比之下,我们提出了一个快速准确的内存 LSH 框架,称为 PM-LSH,旨在计算c -ANN 对大规模、高维数据集的查询。首先,我们采用简单而有效的 PM 树来索引数据点。其次,我们开发了一个可调置信区间来实现准确的距离估计并保证高质量的结果。第三,我们在 PM 树之上提出了一种有效的算法,以提高计算c- ANN 查询的性能。此外,我们扩展了 PM-LSH 以支持高维空间中的最近对 (CP) 搜索。在这里,我们再次采用 PM-tree 在低维空间中组织点,我们提出了一种分支定界算法和半径剪枝技术,以提高计算c -近似最近对(c-ACP) 查询。对真实世界数据的大量实验证明,PM-LSH 在 NN 和 CP 搜索的效率和准确性方面都能够胜过现有的建议。

更新日期:2021-07-04
down
wechat
bug