Abstract
Approximate nearest neighbor (ANN) search in high-dimensional spaces is fundamental in many applications. Locality-sensitive hashing (LSH) is a well-known methodology to solve the ANN problem. Existing LSH-based ANN solutions typically employ a large number of individual indexes optimized for searching efficiency. Updating such indexes might be impractical when processing high-dimensional streaming data. In this paper, we present a novel disk-based LSH index that offers efficient support for both searches and updates. The contributions of our work are threefold. First, we use the write-friendly LSM-trees to store the LSH projections to facilitate efficient updates. Second, we develop a novel estimation scheme to estimate the number of required LSH functions, with which the disk storage and access costs are effectively reduced. Third, we exploit both the collision number and the projection distance to improve the efficiency of candidate selection, improving the search performance with theoretical guarantees on the result quality. Experiments on four real-world datasets show that our proposal outperforms the state-of-the-art schemes.
Similar content being viewed by others
Notes
Note that Algorithm 1 sets K proportional to the total number of objects, \(|\mathcal {D}|\). The ratio, denoted \(\beta\), controls the false-positive rate during the search.
That is, SSPD values of different data objects are considered distinct elements in \(A_{s,j}\) even though they might be equal. Thus, the size of the multiset \(A_{s,j}\) is precisely the number of objects colliding at least j times with the query (within radius \(R=1\)).
References
Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp 194–205
Indyk P ,Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp 604–613
Li Q, Sun Z, He R, Tan T (2017) Deep supervised discrete hashing. In: NIPS, pp 2482–2491
Gao J, Jagadish HV, Lu W,Ooi BC (2014) DSH: data sensitive hashing for high-dimensional k-NN search. In: SIGMOD, pp 1127–1138
Zhao K, Lu H, Mei J (2014) Locality preserving hashing. In: AAAI, pp 2874–2881
Gao J., Jagadish HV, Ooi BC, Wang S (2015) Selective hashing: closing the gap between radius search and k-nn search. In: SIGKDD, pp 349–358
Lv Q, Josephson W, Wang Z, Charikar M, Li K (2007) Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp 950–961
Gan J, Feng J, Fang Q, Ng W (2012) Locality-sensitive hashing scheme based on dynamic collision counting. In: SIGMOD, pages 541–552
Huang Q, Feng J, Fang Q, Ng W, Wang W (2017) Query-aware locality-sensitive hashing scheme for l\({}_{{p}}\) norm. VLDB J 26(5):683–708
Sun Y, Wang W, Qin J, Zhang Y, Lin X (2014) SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB 8(1):1–12
Gama J, Sebastião R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346
Zhai T, Gao Y, Wang H, Cao L (2017) Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Min Knowl Disc 31:1242–1265
Andoni A, Indyk P, Laarhoven T, Razenshteyn IP, Schmidt L (2015) Practical and optimal LSH for angular distance. In: NIPS, pp 1225–1233
Eshghi K, Rajaram S (2008) Locality sensitive hash functions based on concomitant rank order statistics. In: SIGKDD, pp 221–229
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: SoCG, pp 253–262
O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The log-structured merge-tree (LSM-tree). Acta Informatica 33(4):351–385
Tao Y, Yi K, Sheng C, Kalnis P (2010) Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans Database Syst 35(35):20:1–20:4620:46
Zheng Y, Guo Q, Tung Anthony KH, Wu S (2016) LazyLSH: approximate nearest neighbor search for multiple distance functions with a single index. In: SIGMOD, pp 2023–2037
Panigrahy R (2006) Entropy based nearest neighbor search in high dimensions. In: SODA, pp 1186–1195
Liu Y, Cui J, Huang Z, Li H, Shen HT (2014) SK-LSH: an efficient index structure for approximate nearest neighbor search. PVLDB 7(9):745–756
Chu C, Gong D, Chen K, Guo Y, Han J, Ding G (2019) Optimized projection for hashing. Pattern Recognit Lett 117:169–178
Liu X, Nie X, Wang Y, Yin Y (2019) Jointly multiple hash learning. In: AAAI, pp 9981–9982
Dayan Ni, Athanassoulis M, Idreos S (2017) Monkey: optimal navigable key-value store. In: SIGMOD, pp 79–94
Liu W, Wang H, Zhang Y, Wang W, Qin L (2019) I-LSH: I/O efficient c-approximate nearest neighbor search in high-dimensional space. In: ICDE, pp 1670–1673
Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128
Acknowledgements
The authors would like to thank the editor and anonymous reviewers for their valuable suggestions and comments. This work was funded in part by the Center of Excellence for NEOM Research at KAUST, REI/1/4178-01-01, the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under award numbers BAS/1/1624-01, REI/1/0018-01-01, REI/1/4216-01-01, REI/1/4437-01-01, and REI/1/4473-01-01.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix 1: Sketch of Proof of Lemma 1
Appendix 1: Sketch of Proof of Lemma 1
In this section, we provide a brief Proof of Lemma 1 in Sect. 4.3. Recall that, for arbitrary \(\varvec{o}_1,\varvec{o}_2,\varvec{q}\in \mathbb {R}^d\), Lemma 1 asserts the following:
where \(s_1={\text{ dist }}(\varvec{o}_1,\varvec{q})\), \(s_2={\text{ dist }}(\varvec{o}_2,\varvec{q})\), and \(s_1 < s_2\).
To prove Lemma 1, we need to consider the following functions:
The following several technical lemmata are useful.
Lemma 3
F(k, m, x) is monotonically decreasing with respect to x if \(0\le x\le 1\) and \(k<m\).
Lemma 4
\(G(n,n^\prime ,m,x)\) is monotonically decreasing with respect to x if \(0\le x\le 1\) and \(n\le n^\prime <m\).
Lemmata 3 and 4 state some monotonicity properties of the binomial distribution, of which the proofs are pure mathematical and are essentially unsophisticated. Therefore, we omit their proofs to avoid further distraction from our main focus.
We are now ready to present the proof of Lemma 1.
Proof
(Sketch of Proof of Lemma 1) Recall that we have established
where \(p(s,R)=2\varPhi \left( \frac{wR}{2s}\right) -1\) (i.e., Eq. 3). Since p(s, R) is monotonically decreasing with respect to s, it is clear that, according to Lemma 3, \(\Pr [\mathbf{E }_1(s,R)]\) is monotonically decreasing with respect to s, which proves Inequality 4.
To prove Inequality 5, note that
Consider \(g_i(L,|\mathcal {H}|,p(s,R))\) as a distribution over the number of collisions \(i=L,L+1,\ldots ,m\). The monotonicity of
as stated in Lemma 4, implies that \(g_i(L,|\mathcal {H}|,p(s,R))\) is more skewed toward small i’s as s increases. In addition, it can be proved that the probability \(\Pr \left[ \mathbf{E }_2(s,R)|{\text{ Col }}(\varvec{o},R)=i\right]\), viewed as a function of i and s, is monotonically decreasing with respect to s and monotonically increasing with respect to i. Combining the above results, we obtain
Finally, Inequality 6 is a direct corollary of Inequalities 4 and 5 since \(\mathbf{E }_0(s,R)=\mathbf{E }_1(s,R)\cap \mathbf{E }_2(s,R)\) by definition. \(\square\)
Rights and permissions
About this article
Cite this article
Wang, H., Yang, C., Zhang, X. et al. Efficient locality-sensitive hashing over high-dimensional streaming data. Neural Comput & Applic 35, 3753–3766 (2023). https://doi.org/10.1007/s00521-020-05336-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05336-1