Skip to main content
Log in

Efficient locality-sensitive hashing over high-dimensional streaming data

  • S.I. : Deep Geospatial Data Understanding
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Approximate nearest neighbor (ANN) search in high-dimensional spaces is fundamental in many applications. Locality-sensitive hashing (LSH) is a well-known methodology to solve the ANN problem. Existing LSH-based ANN solutions typically employ a large number of individual indexes optimized for searching efficiency. Updating such indexes might be impractical when processing high-dimensional streaming data. In this paper, we present a novel disk-based LSH index that offers efficient support for both searches and updates. The contributions of our work are threefold. First, we use the write-friendly LSM-trees to store the LSH projections to facilitate efficient updates. Second, we develop a novel estimation scheme to estimate the number of required LSH functions, with which the disk storage and access costs are effectively reduced. Third, we exploit both the collision number and the projection distance to improve the efficiency of candidate selection, improving the search performance with theoretical guarantees on the result quality. Experiments on four real-world datasets show that our proposal outperforms the state-of-the-art schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Note that Algorithm  1 sets K proportional to the total number of objects, \(|\mathcal {D}|\). The ratio, denoted \(\beta\), controls the false-positive rate during the search.

  2. That is, SSPD values of different data objects are considered distinct elements in \(A_{s,j}\) even though they might be equal. Thus, the size of the multiset \(A_{s,j}\) is precisely the number of objects colliding at least j times with the query (within radius \(R=1\)).

  3. http://corpus-texmex.irisa.fr/.

  4. http://www.ifs.tuwien.ac.at/mir/msd/.

  5. http://corpus-texmex.irisa.fr/.

  6. http://phototour.cs.washington.edu/patches/.

References

  1. Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp 194–205

  2. Indyk P ,Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp 604–613

  3. Li Q, Sun Z, He R, Tan T (2017) Deep supervised discrete hashing. In: NIPS, pp 2482–2491

  4. Gao J, Jagadish HV, Lu W,Ooi BC (2014) DSH: data sensitive hashing for high-dimensional k-NN search. In: SIGMOD, pp 1127–1138

  5. Zhao K, Lu H, Mei J (2014) Locality preserving hashing. In: AAAI, pp 2874–2881

  6. Gao J., Jagadish HV, Ooi BC, Wang S (2015) Selective hashing: closing the gap between radius search and k-nn search. In: SIGKDD, pp 349–358

  7. Lv Q, Josephson W, Wang Z, Charikar M, Li K (2007) Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp 950–961

  8. Gan J, Feng J, Fang Q, Ng W (2012) Locality-sensitive hashing scheme based on dynamic collision counting. In: SIGMOD, pages 541–552

  9. Huang Q, Feng J, Fang Q, Ng W, Wang W (2017) Query-aware locality-sensitive hashing scheme for l\({}_{{p}}\) norm. VLDB J 26(5):683–708

    Article  Google Scholar 

  10. Sun Y, Wang W, Qin J, Zhang Y, Lin X (2014) SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB 8(1):1–12

    Google Scholar 

  11. Gama J, Sebastião R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346

    Article  MATH  Google Scholar 

  12. Zhai T, Gao Y, Wang H, Cao L (2017) Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Min Knowl Disc 31:1242–1265

    Article  MATH  Google Scholar 

  13. Andoni A, Indyk P, Laarhoven T, Razenshteyn IP, Schmidt L (2015) Practical and optimal LSH for angular distance. In: NIPS, pp 1225–1233

  14. Eshghi K, Rajaram S (2008) Locality sensitive hash functions based on concomitant rank order statistics. In: SIGKDD, pp 221–229

  15. Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: SoCG, pp 253–262

  16. O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The log-structured merge-tree (LSM-tree). Acta Informatica 33(4):351–385

    Article  MATH  Google Scholar 

  17. Tao Y, Yi K, Sheng C, Kalnis P (2010) Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans Database Syst 35(35):20:1–20:4620:46

    Google Scholar 

  18. Zheng Y, Guo Q, Tung Anthony KH, Wu S (2016) LazyLSH: approximate nearest neighbor search for multiple distance functions with a single index. In: SIGMOD, pp 2023–2037

  19. Panigrahy R (2006) Entropy based nearest neighbor search in high dimensions. In: SODA, pp 1186–1195

  20. Liu Y, Cui J, Huang Z, Li H, Shen HT (2014) SK-LSH: an efficient index structure for approximate nearest neighbor search. PVLDB 7(9):745–756

    Google Scholar 

  21. Chu C, Gong D, Chen K, Guo Y, Han J, Ding G (2019) Optimized projection for hashing. Pattern Recognit Lett 117:169–178

    Article  Google Scholar 

  22. Liu X, Nie X, Wang Y, Yin Y (2019) Jointly multiple hash learning. In: AAAI, pp 9981–9982

  23. Dayan Ni, Athanassoulis M, Idreos S (2017) Monkey: optimal navigable key-value store. In: SIGMOD, pp 79–94

  24. Liu W, Wang H, Zhang Y, Wang W, Qin L (2019) I-LSH: I/O efficient c-approximate nearest neighbor search in high-dimensional space. In: ICDE, pp 1670–1673

  25. Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their valuable suggestions and comments. This work was funded in part by the Center of Excellence for NEOM Research at KAUST, REI/1/4178-01-01, the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under award numbers BAS/1/1624-01, REI/1/0018-01-01, REI/1/4216-01-01, REI/1/4437-01-01, and REI/1/4473-01-01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengcheng Yang.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1: Sketch of Proof of Lemma 1

Appendix 1: Sketch of Proof of Lemma 1

In this section, we provide a brief Proof of Lemma  1 in Sect. 4.3. Recall that, for arbitrary \(\varvec{o}_1,\varvec{o}_2,\varvec{q}\in \mathbb {R}^d\), Lemma  1 asserts the following:

$$\begin{aligned} \Pr \left[ \mathbf{E }_1(s_1,R)\right]&>\Pr \left[ \mathbf{E }_1(s_2,R)\right] , \end{aligned}$$
(4)
$$\begin{aligned} \Pr \left[ \mathbf{E }_2(s_1,R)|\mathbf{E }_1(s_1,R)\right]&>\Pr \left[ \mathbf{E }_2(s_2,R)|\mathbf{E }_1(s_2,R)\right] , \end{aligned}$$
(5)
$$\begin{aligned} \Pr \left[ \mathbf{E }_0(s_1,R)\right]&> \Pr \left[ \mathbf{E }_0(s_2,R)\right] , \end{aligned}$$
(6)

where \(s_1={\text{ dist }}(\varvec{o}_1,\varvec{q})\), \(s_2={\text{ dist }}(\varvec{o}_2,\varvec{q})\), and \(s_1 < s_2\).

To prove Lemma  1, we need to consider the following functions:

$$\begin{aligned} F(k,m,x)&\overset{{\tiny {\text{ def }}}}=\sum _{i=0}^k\left( {\begin{array}{c}m\\ i\end{array}}\right) x^i(1-x)^{m-i},\\ g_i(n,m,x)&\overset{{\tiny {\text{ def }}}}=\frac{\left( {\begin{array}{c}m\\ i\end{array}}\right) x^i(1-x)^{m-i}}{1-F(n-1,m,x)},\\ G(n,n^\prime ,m,x)&\overset{{\tiny {\text{ def }}}}=\sum _{i=n}^{n^\prime }g_i(n,m,x)\\&=\frac{F(n^\prime ,m,x)-F(n-1,m,x)}{1-F(n-1,m,x)}. \end{aligned}$$

The following several technical lemmata are useful.

Lemma 3

F(kmx) is monotonically decreasing with respect to x if \(0\le x\le 1\) and \(k<m\).

Lemma 4

\(G(n,n^\prime ,m,x)\) is monotonically decreasing with respect to x if \(0\le x\le 1\) and \(n\le n^\prime <m\).

Lemmata  3 and  4 state some monotonicity properties of the binomial distribution, of which the proofs are pure mathematical and are essentially unsophisticated. Therefore, we omit their proofs to avoid further distraction from our main focus.

We are now ready to present the proof of Lemma  1.

Proof

(Sketch of Proof of Lemma  1) Recall that we have established

$$\begin{aligned} \Pr \left[ \mathbf{E }_1(s,R)\right]&=\sum \limits _{i=L}^{|\mathcal {H}|} \left( {\begin{array}{c}|\mathcal {H}|\\ i\end{array}}\right) p(s,R)^i(1-p(s,R))^{|\mathcal {H}|-i},\\&=1-F\left( L-1,|\mathcal {H}|,p(s,R)\right) , \end{aligned}$$

where \(p(s,R)=2\varPhi \left( \frac{wR}{2s}\right) -1\) (i.e., Eq.  3). Since p(sR) is monotonically decreasing with respect to s, it is clear that, according to Lemma  3, \(\Pr [\mathbf{E }_1(s,R)]\) is monotonically decreasing with respect to s, which proves Inequality  4.

To prove Inequality  5, note that

$$\begin{aligned} \Pr&\left[ \mathbf{E }_2(s,R)|\mathbf{E }_1(s,R)\right] \\&=\sum _{i=L}^m \frac{\Pr [{\text{ Col }}(\varvec{o},R)=i]}{\Pr \left[ \mathbf{E }_1(s,R)\right] }\Pr \left[ \mathbf{E }_2(s,R)|{\text{ Col }}(\varvec{o},R)=i\right] \\&=\sum _{i=L}^m g_i(L,|\mathcal {H}|,p(s,R))\Pr \left[ \mathbf{E }_2(s,R)|{\text{ Col }}(\varvec{o},R)=i\right] . \end{aligned}$$

Consider \(g_i(L,|\mathcal {H}|,p(s,R))\) as a distribution over the number of collisions \(i=L,L+1,\ldots ,m\). The monotonicity of

$$\begin{aligned} G(L,L',|\mathcal {H}|,p(s,R))=\sum _{i=L}^{L^\prime }g_i(L,|\mathcal {H}|,p(s,R)), \end{aligned}$$

as stated in Lemma  4, implies that \(g_i(L,|\mathcal {H}|,p(s,R))\) is more skewed toward small i’s as s increases. In addition, it can be proved that the probability \(\Pr \left[ \mathbf{E }_2(s,R)|{\text{ Col }}(\varvec{o},R)=i\right]\), viewed as a function of i and s, is monotonically decreasing with respect to s and monotonically increasing with respect to i. Combining the above results, we obtain

$$\begin{aligned} \Pr&\left[ \mathbf{E }_2(s_1,R)|\mathbf{E }_1(s_1,R)\right] \\&=\sum _{i=L}^{m}g_i(L,|\mathcal {H}|,p(s_1,R))\Pr \left[ \mathbf{E }_2(s_1,R)|{\text{ Col }}(\varvec{o},R)=i\right] \\&>\sum _{i=L}^{m}g_i(L,|\mathcal {H}|,p(s_2,R))\Pr \left[ \mathbf{E }_2(s_1,R)|{\text{ Col }}(\varvec{o},R)=i\right] \\&>\sum _{i=L}^{m}g_i(L,|\mathcal {H}|,p(s_2,R))\Pr \left[ \mathbf{E }_2(s_2,R)|{\text{ Col }}(\varvec{o},R)=i\right] \\&=\Pr \left[ \mathbf{E }_2(s_2,R)|\mathbf{E }_1(s_2,R)\right] . \end{aligned}$$

Finally, Inequality  6 is a direct corollary of Inequalities  4 and  5 since \(\mathbf{E }_0(s,R)=\mathbf{E }_1(s,R)\cap \mathbf{E }_2(s,R)\) by definition. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Yang, C., Zhang, X. et al. Efficient locality-sensitive hashing over high-dimensional streaming data. Neural Comput & Applic 35, 3753–3766 (2023). https://doi.org/10.1007/s00521-020-05336-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05336-1

Keywords

Navigation