Efficient locality-sensitive hashing over high-dimensional streaming data

Wang, Hao; Yang, Chengcheng; Zhang, Xiangliang; Gao, Xin

doi:10.1007/s00521-020-05336-1

Efficient locality-sensitive hashing over high-dimensional streaming data

S.I. : Deep Geospatial Data Understanding
Published: 17 September 2020

Volume 35, pages 3753–3766, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Hao Wang¹,
Chengcheng Yang ORCID: orcid.org/0000-0001-5128-8882^2,3,
Xiangliang Zhang² &
…
Xin Gao¹

533 Accesses
1 Citation
Explore all metrics

Abstract

Approximate nearest neighbor (ANN) search in high-dimensional spaces is fundamental in many applications. Locality-sensitive hashing (LSH) is a well-known methodology to solve the ANN problem. Existing LSH-based ANN solutions typically employ a large number of individual indexes optimized for searching efficiency. Updating such indexes might be impractical when processing high-dimensional streaming data. In this paper, we present a novel disk-based LSH index that offers efficient support for both searches and updates. The contributions of our work are threefold. First, we use the write-friendly LSM-trees to store the LSH projections to facilitate efficient updates. Second, we develop a novel estimation scheme to estimate the number of required LSH functions, with which the disk storage and access costs are effectively reduced. Third, we exploit both the collision number and the projection distance to improve the efficiency of candidate selection, improving the search performance with theoretical guarantees on the result quality. Experiments on four real-world datasets show that our proposal outperforms the state-of-the-art schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

Multidimensional scaling for big data

Article Open access 13 April 2024

SampleHST-X: A Point and Collective Anomaly-Aware Trace Sampling Pipeline with Approximate Half Space Trees

Article Open access 16 April 2024

Notes

Note that Algorithm 1 sets K proportional to the total number of objects, $|\mathcal {D}|$. The ratio, denoted $\beta$, controls the false-positive rate during the search.
That is, SSPD values of different data objects are considered distinct elements in $A_{s,j}$ even though they might be equal. Thus, the size of the multiset $A_{s,j}$ is precisely the number of objects colliding at least j times with the query (within radius $R=1$).
http://corpus-texmex.irisa.fr/.
http://www.ifs.tuwien.ac.at/mir/msd/.
http://corpus-texmex.irisa.fr/.
http://phototour.cs.washington.edu/patches/.

References

Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp 194–205
Indyk P ,Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp 604–613
Li Q, Sun Z, He R, Tan T (2017) Deep supervised discrete hashing. In: NIPS, pp 2482–2491
Gao J, Jagadish HV, Lu W,Ooi BC (2014) DSH: data sensitive hashing for high-dimensional k-NN search. In: SIGMOD, pp 1127–1138
Zhao K, Lu H, Mei J (2014) Locality preserving hashing. In: AAAI, pp 2874–2881
Gao J., Jagadish HV, Ooi BC, Wang S (2015) Selective hashing: closing the gap between radius search and k-nn search. In: SIGKDD, pp 349–358
Lv Q, Josephson W, Wang Z, Charikar M, Li K (2007) Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp 950–961
Gan J, Feng J, Fang Q, Ng W (2012) Locality-sensitive hashing scheme based on dynamic collision counting. In: SIGMOD, pages 541–552
Huang Q, Feng J, Fang Q, Ng W, Wang W (2017) Query-aware locality-sensitive hashing scheme for l${}_{{p}}$ norm. VLDB J 26(5):683–708
Article Google Scholar
Sun Y, Wang W, Qin J, Zhang Y, Lin X (2014) SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB 8(1):1–12
Google Scholar
Gama J, Sebastião R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346
Article MATH Google Scholar
Zhai T, Gao Y, Wang H, Cao L (2017) Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Min Knowl Disc 31:1242–1265
Article MATH Google Scholar
Andoni A, Indyk P, Laarhoven T, Razenshteyn IP, Schmidt L (2015) Practical and optimal LSH for angular distance. In: NIPS, pp 1225–1233
Eshghi K, Rajaram S (2008) Locality sensitive hash functions based on concomitant rank order statistics. In: SIGKDD, pp 221–229
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: SoCG, pp 253–262
O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The log-structured merge-tree (LSM-tree). Acta Informatica 33(4):351–385
Article MATH Google Scholar
Tao Y, Yi K, Sheng C, Kalnis P (2010) Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans Database Syst 35(35):20:1–20:4620:46
Google Scholar
Zheng Y, Guo Q, Tung Anthony KH, Wu S (2016) LazyLSH: approximate nearest neighbor search for multiple distance functions with a single index. In: SIGMOD, pp 2023–2037
Panigrahy R (2006) Entropy based nearest neighbor search in high dimensions. In: SODA, pp 1186–1195
Liu Y, Cui J, Huang Z, Li H, Shen HT (2014) SK-LSH: an efficient index structure for approximate nearest neighbor search. PVLDB 7(9):745–756
Google Scholar
Chu C, Gong D, Chen K, Guo Y, Han J, Ding G (2019) Optimized projection for hashing. Pattern Recognit Lett 117:169–178
Article Google Scholar
Liu X, Nie X, Wang Y, Yin Y (2019) Jointly multiple hash learning. In: AAAI, pp 9981–9982
Dayan Ni, Athanassoulis M, Idreos S (2017) Monkey: optimal navigable key-value store. In: SIGMOD, pp 79–94
Liu W, Wang H, Zhang Y, Wang W, Qin L (2019) I-LSH: I/O efficient c-approximate nearest neighbor search in high-dimensional space. In: ICDE, pp 1670–1673
Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their valuable suggestions and comments. This work was funded in part by the Center of Excellence for NEOM Research at KAUST, REI/1/4178-01-01, the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under award numbers BAS/1/1624-01, REI/1/0018-01-01, REI/1/4216-01-01, REI/1/4437-01-01, and REI/1/4473-01-01.

Author information

Authors and Affiliations

Computational Bioscience Research Center, CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Hao Wang & Xin Gao
Machine Intelligence and kNowledge Engineering Laboratory, CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Chengcheng Yang & Xiangliang Zhang
Shenzhen University, Shenzhen, China
Chengcheng Yang

Authors

Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chengcheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangliang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chengcheng Yang.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1: Sketch of Proof of Lemma 1

In this section, we provide a brief Proof of Lemma 1 in Sect. 4.3. Recall that, for arbitrary $\varvec{o}_1,\varvec{o}_2,\varvec{q}\in \mathbb {R}^d$, Lemma 1 asserts the following:

$$\begin{aligned} \Pr \left[ \mathbf{E }_1(s_1,R)\right]&>\Pr \left[ \mathbf{E }_1(s_2,R)\right] , \end{aligned}$$

(4)

$$\begin{aligned} \Pr \left[ \mathbf{E }_2(s_1,R)|\mathbf{E }_1(s_1,R)\right]&>\Pr \left[ \mathbf{E }_2(s_2,R)|\mathbf{E }_1(s_2,R)\right] , \end{aligned}$$

(5)

$$\begin{aligned} \Pr \left[ \mathbf{E }_0(s_1,R)\right]&> \Pr \left[ \mathbf{E }_0(s_2,R)\right] , \end{aligned}$$

(6)

where $s_1={\text{ dist }}(\varvec{o}_1,\varvec{q})$, $s_2={\text{ dist }}(\varvec{o}_2,\varvec{q})$, and $s_1 < s_2$.

To prove Lemma 1, we need to consider the following functions:

$$\begin{aligned} F(k,m,x)&\overset{{\tiny {\text{ def }}}}=\sum _{i=0}^k\left( {\begin{array}{c}m\\ i\end{array}}\right) x^i(1-x)^{m-i},\\ g_i(n,m,x)&\overset{{\tiny {\text{ def }}}}=\frac{\left( {\begin{array}{c}m\\ i\end{array}}\right) x^i(1-x)^{m-i}}{1-F(n-1,m,x)},\\ G(n,n^\prime ,m,x)&\overset{{\tiny {\text{ def }}}}=\sum _{i=n}^{n^\prime }g_i(n,m,x)\\&=\frac{F(n^\prime ,m,x)-F(n-1,m,x)}{1-F(n-1,m,x)}. \end{aligned}$$

The following several technical lemmata are useful.

Lemma 3

F(k, m, x) is monotonically decreasing with respect to x if $0\le x\le 1$ and $k<m$.

Lemma 4

$G(n,n^\prime ,m,x)$ is monotonically decreasing with respect to x if $0\le x\le 1$ and $n\le n^\prime <m$.

Lemmata 3 and 4 state some monotonicity properties of the binomial distribution, of which the proofs are pure mathematical and are essentially unsophisticated. Therefore, we omit their proofs to avoid further distraction from our main focus.

We are now ready to present the proof of Lemma 1.

Proof

(Sketch of Proof of Lemma 1) Recall that we have established

$$\begin{aligned} \Pr \left[ \mathbf{E }_1(s,R)\right]&=\sum \limits _{i=L}^{|\mathcal {H}|} \left( {\begin{array}{c}|\mathcal {H}|\\ i\end{array}}\right) p(s,R)^i(1-p(s,R))^{|\mathcal {H}|-i},\\&=1-F\left( L-1,|\mathcal {H}|,p(s,R)\right) , \end{aligned}$$

where $p(s,R)=2\varPhi \left( \frac{wR}{2s}\right) -1$ (i.e., Eq. 3). Since p(s, R) is monotonically decreasing with respect to s, it is clear that, according to Lemma 3, $\Pr [\mathbf{E }_1(s,R)]$ is monotonically decreasing with respect to s, which proves Inequality 4.

To prove Inequality 5, note that

$$\begin{aligned} \Pr&\left[ \mathbf{E }_2(s,R)|\mathbf{E }_1(s,R)\right] \\&=\sum _{i=L}^m \frac{\Pr [{\text{ Col }}(\varvec{o},R)=i]}{\Pr \left[ \mathbf{E }_1(s,R)\right] }\Pr \left[ \mathbf{E }_2(s,R)|{\text{ Col }}(\varvec{o},R)=i\right] \\&=\sum _{i=L}^m g_i(L,|\mathcal {H}|,p(s,R))\Pr \left[ \mathbf{E }_2(s,R)|{\text{ Col }}(\varvec{o},R)=i\right] . \end{aligned}$$

Consider $g_i(L,|\mathcal {H}|,p(s,R))$ as a distribution over the number of collisions $i=L,L+1,\ldots ,m$. The monotonicity of

$$\begin{aligned} G(L,L',|\mathcal {H}|,p(s,R))=\sum _{i=L}^{L^\prime }g_i(L,|\mathcal {H}|,p(s,R)), \end{aligned}$$

as stated in Lemma 4, implies that $g_i(L,|\mathcal {H}|,p(s,R))$ is more skewed toward small i’s as s increases. In addition, it can be proved that the probability $\Pr \left[ \mathbf{E }_2(s,R)|{\text{ Col }}(\varvec{o},R)=i\right]$, viewed as a function of i and s, is monotonically decreasing with respect to s and monotonically increasing with respect to i. Combining the above results, we obtain

$$\begin{aligned} \Pr&\left[ \mathbf{E }_2(s_1,R)|\mathbf{E }_1(s_1,R)\right] \\&=\sum _{i=L}^{m}g_i(L,|\mathcal {H}|,p(s_1,R))\Pr \left[ \mathbf{E }_2(s_1,R)|{\text{ Col }}(\varvec{o},R)=i\right] \\&>\sum _{i=L}^{m}g_i(L,|\mathcal {H}|,p(s_2,R))\Pr \left[ \mathbf{E }_2(s_1,R)|{\text{ Col }}(\varvec{o},R)=i\right] \\&>\sum _{i=L}^{m}g_i(L,|\mathcal {H}|,p(s_2,R))\Pr \left[ \mathbf{E }_2(s_2,R)|{\text{ Col }}(\varvec{o},R)=i\right] \\&=\Pr \left[ \mathbf{E }_2(s_2,R)|\mathbf{E }_1(s_2,R)\right] . \end{aligned}$$

Finally, Inequality 6 is a direct corollary of Inequalities 4 and 5 since $\mathbf{E }_0(s,R)=\mathbf{E }_1(s,R)\cap \mathbf{E }_2(s,R)$ by definition. $\square$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Yang, C., Zhang, X. et al. Efficient locality-sensitive hashing over high-dimensional streaming data. Neural Comput & Applic 35, 3753–3766 (2023). https://doi.org/10.1007/s00521-020-05336-1

Download citation

Received: 20 July 2020
Accepted: 02 September 2020
Published: 17 September 2020
Issue Date: February 2023
DOI: https://doi.org/10.1007/s00521-020-05336-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient locality-sensitive hashing over high-dimensional streaming data

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Multidimensional scaling for big data

SampleHST-X: A Point and Collective Anomaly-Aware Trace Sampling Pipeline with Approximate Half Space Trees

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix 1: Sketch of Proof of Lemma 1

Lemma 3

Lemma 4

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient locality-sensitive hashing over high-dimensional streaming data

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Multidimensional scaling for big data

SampleHST-X: A Point and Collective Anomaly-Aware Trace Sampling Pipeline with Approximate Half Space Trees

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix 1: Sketch of Proof of Lemma 1

Appendix 1: Sketch of Proof of Lemma 1

Lemma 3

Lemma 4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation