EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search

Liu, Wanqi; Wang, Hanchen; Zhang, Ying; Wang, Wei; Qin, Lu; Lin, Xuemin

doi:10.1007/s00778-020-00635-4

EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search

Regular Paper
Published: 30 September 2020

Volume 30, pages 215–235, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Wanqi Liu^1,2,
Hanchen Wang ORCID: orcid.org/0000-0003-3158-9586^1,2,
Ying Zhang²,
Wei Wang³,
Lu Qin² &
…
Xuemin Lin³

811 Accesses
7 Citations
Explore all metrics

Abstract

Nearest neighbor in high-dimensional space has been widely used in various fields such as databases, data mining and machine learning. The problem has been well solved in low-dimensional space. However, when it comes to high-dimensional space, due to the curse of dimensionality, the problem is challenging. As a trade-off between accuracy and efficiency, c-approximate nearest neighbor (c-ANN) is considered instead of an exact NN search in high-dimensional space. A variety of c-ANN algorithms have been proposed, one of the important schemes for the c-ANN problem is called Locality-sensitive hashing (LSH), which projects a high-dimensional dataset into a low-dimensional dataset and can return a c-ANN with a constant probability. In this paper, we propose a new aggressive early-termination (ET) condition which stops the algorithm with LSH scheme earlier under the same theoretical guarantee, leading to a smaller I/O cost and less running time. Unlike the “conservative” early termination conditions used in previous studies, we propose an “aggressive” early termination condition which can stop much earlier. Though it is not absolutely safe and may result in the probability of failure, we can still devise more efficient algorithms under the same theoretical guarantee by carefully considering the failure probabilities brought by LSH scheme and early termination. Furthermore, we also introduce an incremental searching strategy. Unlike the previous LSH methods, which expand the bucket width in an exponential way, we employ a more natural search strategy to incrementally access the hash values of the objects. We also provide a rigorous theoretical analysis to underpin our incremental search strategy and the new early termination technique. Our comprehensive experiment results show that, compared with the state-of-the-art I/O efficient c-ANN techniques, our proposed algorithm, namely EI-LSH, can achieve much better I/O efficiency under the same theoretical guarantee.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

Article 03 July 2021

A robust method based on locality sensitive hashing for K-nearest neighbors searching

Article 12 March 2022

Fast Nearest Neighbor Search Based on Approximate k-NN Graph

Notes

Note that each entry takes 8 bytes for one hash value and the object ID.
Here, it is not necessary that the instance belongs to A.
https://github.com/DBWangGroupUNSW/nns_benchmark/tree/master/data.

References

Arora, A., Sinha, S., Kumar, P., Bhattacharya, A.: Hd-index: pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. Proce. VLDB Endow. 11(8), 906–919 (2018)
Article Google Scholar
Bahmani, B., Goel, A., Shinde R.: Efficient distributed locality sensitive hashing. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2174–2178. ACM, New York (2012)
Bast, H., Majumdar, D., Schenkel, R., Theobald, M., Weikum, G.: Io-top-k: index-access optimized top-k query processing. In: Dayal, U., Whang, K., Lomet, D.B., Alonso, G., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y. (eds.) Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006, pp. 475–486. ACM, New York (2006)
Bernhardsson,E.: Annoy at github. https://github.com/spotify/annoy (2015)
Chen, D., Sun, G., Gong, N.Z., Zhong, X.: Efficient top-k query algorithms using density index. In: Zeng, D. (ed.) Applied Informatics and Communication—International Conference, ICAIC 2011, Xi’an, China, 20–21 August 2011, Proceedings, Part I, Communications in Computer and Information Science, vol. 224, pp. 38–45. Springer, Berlin (2011)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. ACM, New York (2004)
Deshpande, P.M., Padmanabhan, D., Kummamuru, K.: Efficient online top-k retrieval with arbitrary similarity measures. In: Kemper, A., Valduriez, P., Mouaddib, N., Teubner, J., Bouzeghoub, M., Markl, V., Amsaleg, L., Manolescu, I. (eds.) Proceedings of EDBT 2008, 11th International Conference on Extending Database Technology, Nantes, France, 25–29 March 2008, ACM International Conference Proceeding Series, vol. 261, pp. 356–367. ACM, New York (2008)
Dong, W., Charikar, M., Li, K: Efficient k-nearest neighbor graph construction for generic similarity measures. In: WWW (2011)
Fagin, R.: Combining fuzzy information: an overview. SIGMOD Rec. 31(2), 109–118 (2002)
Article Google Scholar
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)
Article MathSciNet Google Scholar
Gan, J., Feng, J., Fang, Q., Ng, W.: Locality-sensitive hashing scheme based on dynamic collision counting. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 541–552. ACM, New York (2012)
Gao, J., Jagadish, H.V., Lu, W., Ooi, B.C.: DSH: data sensitive hashing for high-dimensional k-nnsearch. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1127–1138. ACM, New York (2014)
Gao, J., Jagadish, H.V., Ooi, B.C., Wang, S.: Selective hashing: closing the gap between radius search and k-nn search. In: SIGKDD (2015)
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: ACM SIGKDD, pp. 855–864 (2016)
Gu, Y., Guo, Y., Song, Y., Zhou, X., Yu, G.: Approximate order-sensitive k-nn queries over correlated high-dimensional data. IEEE Trans. Knowl. Data Eng. 1, 1–1 (2018)
Google Scholar
Haghani, P.,Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 744–755. ACM, New York (2009)
Holland, S.M.: Principal components analysis (PCA). Department of Geology, University of Georgia, Athens, GA, pp. 30602–2501 (2008)
Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proc. VLDB Endow. 9(1), 1–12 (2015)
Article Google Scholar
Indyk, P. Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, 23-26 May 1998, pp. 604–613 (1998)
Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)
Article Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. CoRR (2017) arXiv:1702.08734
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Kumar, R., Punera, K., Suel, T., Vassilvitskii, S.: Top-k aggregation using intersections of ranked inputs. In: Baeza-Yates, R., Boldi, P., Ribeiro-Neto, B.A., Cambazoglu, B.B. (eds.) Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, Barcelona, Spain, 9-11 February 2009, pp. 222–231. ACM, New York (2009)
Li, W., Zhang, Y., Sun, Y., Wang, W., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement (v1.0). CoRR (2016). arXiv:1610.02455
Liu, W., Wang, H., Zhang, Y., Wang, W., Qin, L.: I-lsh: I/o efficient c-approximate nearest neighbor search in high-dimensional space. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1670–1673. IEEE (2019)
Liu, Y., Cheng, H., Cui, J.: PQBF: i/o-efficient approximate nearest neighbor search by product quantization. In: CIKM, pp. 667–676 (2017)
Liu, Y., Cui, J., Huang, Z., Li, H., Shen, H.T.: Sk-lsh: an efficient index structure for approximate nearest neighbor search. Proc. VLDB Endow. 7(9), 745–756 (2014)
Article Google Scholar
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe lsh: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 950–961. VLDB Endowment (2007)
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. CoRR (2016)
Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)
Article Google Scholar
Pan, J., Manocha, D.: Bi-level locality sensitive hashing for k-nearest neighbor computation. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 378–389. IEEE (2012)
Panigrahy, R.: Entropy based nearest neighbor search in high dimensions. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, 22-26 January 2006, pp. 1186–1195 (2006)
Park, Y., Cafarella, M.J., Mozafari, B.: Neighbor-sensitive hashing. PVLDB 9(3), 144–155 (2015)
Google Scholar
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: ACM SIGKDD, pp. 701–710 (2014)
Schenkel, R., Broschart, A., Hwang, S., Theobald, M., Weikum, G.: Efficient text proximity search. In: Ziviani, N., Baeza-Yates, R.A. (eds.) String Processing and Information Retrieval, 14th International Symposium, SPIRE 2007, Santiago, Chile, 29–31 October 2007, Proceedings, Lecture Notes in Computer Science, vol. 4726, pp. 287–299. Springer, Berlin (2007)
Silpa-Anan, C., Hartley, R.I.: Optimised kd-trees for fast image descriptor matching. In: CVPR (2008)
Sun, Y., Wang, W., Qin, J., Zhang, Y., Lin, X.: SRS: solving c-approximate nearest neighbor queries in high dimensional Euclidean space with a tiny index. Proc. VLDB Endow. 8(1), 1–12 (2014)
Article Google Scholar
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 563–576. ACM, New York (2009)
Theobald, M., Bast, H., Majumdar, D., Schenkel, R., Weikum, G.: Topx: efficient and versatile top- k query processing for semistructured data. VLDB J. 17(1), 81–115 (2008)
Article Google Scholar
Wang, J., Huang, P.,Zhao, H., Zhang, Z., Zhao, B., Lee, D.L.: Billion-scale commodity embedding for e-commerce recommendation in Alibaba. In: ACM SIGKDD, pp. 839–848 (2018)
Wang, J., Zhang, T., Song, J., Sebe, N., Shen, H.T.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 769–790 (2018)
Article Google Scholar
Wang, Y., Shrivastava, A., Ryu, J.: Flash: randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search (2017). arXiv preprint arXiv:1709.01190
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS, pp. 1753–1760 (2008)
Zhang, J., Khoram, S., Li, J.: Efficient large-scale approximate nearest neighbor search on OpenCL FPGA. In: CVPR, pp. 4924–4932 (2018)
Zheng, Y., Guo, Q., Tung, A.K., Wu, S.: Lazylsh: approximate nearest neighbor search for multiple distance functions with a single index. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2023–2037. ACM, New York (2016)

Download references

Author information

Authors and Affiliations

Zhejiang Gongshang University, Hangzhou, China
Wanqi Liu & Hanchen Wang
AAII, University of Technology, Sydney, Australia
Wanqi Liu, Hanchen Wang, Ying Zhang & Lu Qin
The University of New South Wales, Sydney, Australia
Wei Wang & Xuemin Lin

Authors

Wanqi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hanchen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lu Qin
View author publications
You can also search for this author in PubMed Google Scholar
Xuemin Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanchen Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Wang, H., Zhang, Y. et al. EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search. The VLDB Journal 30, 215–235 (2021). https://doi.org/10.1007/s00778-020-00635-4

Download citation

Received: 21 November 2019
Revised: 06 August 2020
Accepted: 01 September 2020
Published: 30 September 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s00778-020-00635-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search

Abstract

Access this article

Similar content being viewed by others

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

A robust method based on locality sensitive hashing for K-nearest neighbors searching

Fast Nearest Neighbor Search Based on Approximate k-NN Graph

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search

Abstract

Access this article

Similar content being viewed by others

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

A robust method based on locality sensitive hashing for K-nearest neighbors searching

Fast Nearest Neighbor Search Based on Approximate k-NN Graph

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation