Skip to main content
Log in

EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Nearest neighbor in high-dimensional space has been widely used in various fields such as databases, data mining and machine learning. The problem has been well solved in low-dimensional space. However, when it comes to high-dimensional space, due to the curse of dimensionality, the problem is challenging. As a trade-off between accuracy and efficiency, c-approximate nearest neighbor (c-ANN) is considered instead of an exact NN search in high-dimensional space. A variety of c-ANN algorithms have been proposed, one of the important schemes for the c-ANN problem is called Locality-sensitive hashing (LSH), which projects a high-dimensional dataset into a low-dimensional dataset and can return a c-ANN with a constant probability. In this paper, we propose a new aggressive early-termination (ET) condition which stops the algorithm with LSH scheme earlier under the same theoretical guarantee, leading to a smaller I/O cost and less running time. Unlike the “conservative” early termination conditions used in previous studies, we propose an “aggressive” early termination condition which can stop much earlier. Though it is not absolutely safe and may result in the probability of failure, we can still devise more efficient algorithms under the same theoretical guarantee by carefully considering the failure probabilities brought by LSH scheme and early termination. Furthermore, we also introduce an incremental searching strategy. Unlike the previous LSH methods, which expand the bucket width in an exponential way, we employ a more natural search strategy to incrementally access the hash values of the objects. We also provide a rigorous theoretical analysis to underpin our incremental search strategy and the new early termination technique. Our comprehensive experiment results show that, compared with the state-of-the-art I/O efficient c-ANN techniques, our proposed algorithm, namely EI-LSH, can achieve much better I/O efficiency under the same theoretical guarantee.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Note that each entry takes 8 bytes for one hash value and the object ID.

  2. Here, it is not necessary that the instance belongs to A.

  3. https://github.com/DBWangGroupUNSW/nns_benchmark/tree/master/data.

References

  1. Arora, A., Sinha, S., Kumar, P., Bhattacharya, A.: Hd-index: pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. Proce. VLDB Endow. 11(8), 906–919 (2018)

    Article  Google Scholar 

  2. Bahmani, B., Goel, A., Shinde R.: Efficient distributed locality sensitive hashing. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2174–2178. ACM, New York (2012)

  3. Bast, H., Majumdar, D., Schenkel, R., Theobald, M., Weikum, G.: Io-top-k: index-access optimized top-k query processing. In: Dayal, U., Whang, K., Lomet, D.B., Alonso, G., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y. (eds.) Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006, pp. 475–486. ACM, New York (2006)

  4. Bernhardsson,E.: Annoy at github. https://github.com/spotify/annoy (2015)

  5. Chen, D., Sun, G., Gong, N.Z., Zhong, X.: Efficient top-k query algorithms using density index. In: Zeng, D. (ed.) Applied Informatics and Communication—International Conference, ICAIC 2011, Xi’an, China, 20–21 August 2011, Proceedings, Part I, Communications in Computer and Information Science, vol. 224, pp. 38–45. Springer, Berlin (2011)

  6. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. ACM, New York (2004)

  7. Deshpande, P.M., Padmanabhan, D., Kummamuru, K.: Efficient online top-k retrieval with arbitrary similarity measures. In: Kemper, A., Valduriez, P., Mouaddib, N., Teubner, J., Bouzeghoub, M., Markl, V., Amsaleg, L., Manolescu, I. (eds.) Proceedings of EDBT 2008, 11th International Conference on Extending Database Technology, Nantes, France, 25–29 March 2008, ACM International Conference Proceeding Series, vol. 261, pp. 356–367. ACM, New York (2008)

  8. Dong, W., Charikar, M., Li, K: Efficient k-nearest neighbor graph construction for generic similarity measures. In: WWW (2011)

  9. Fagin, R.: Combining fuzzy information: an overview. SIGMOD Rec. 31(2), 109–118 (2002)

    Article  Google Scholar 

  10. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)

    Article  MathSciNet  Google Scholar 

  11. Gan, J., Feng, J., Fang, Q., Ng, W.: Locality-sensitive hashing scheme based on dynamic collision counting. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 541–552. ACM, New York (2012)

  12. Gao, J., Jagadish, H.V., Lu, W., Ooi, B.C.: DSH: data sensitive hashing for high-dimensional k-nnsearch. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1127–1138. ACM, New York (2014)

  13. Gao, J., Jagadish, H.V., Ooi, B.C., Wang, S.: Selective hashing: closing the gap between radius search and k-nn search. In: SIGKDD (2015)

  14. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: ACM SIGKDD, pp. 855–864 (2016)

  15. Gu, Y., Guo, Y., Song, Y., Zhou, X., Yu, G.: Approximate order-sensitive k-nn queries over correlated high-dimensional data. IEEE Trans. Knowl. Data Eng. 1, 1–1 (2018)

    Google Scholar 

  16. Haghani, P.,Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 744–755. ACM, New York (2009)

  17. Holland, S.M.: Principal components analysis (PCA). Department of Geology, University of Georgia, Athens, GA, pp. 30602–2501 (2008)

  18. Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proc. VLDB Endow. 9(1), 1–12 (2015)

    Article  Google Scholar 

  19. Indyk, P. Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, 23-26 May 1998, pp. 604–613 (1998)

  20. Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)

    Article  Google Scholar 

  21. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. CoRR (2017) arXiv:1702.08734

  22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  23. Kumar, R., Punera, K., Suel, T., Vassilvitskii, S.: Top-k aggregation using intersections of ranked inputs. In: Baeza-Yates, R., Boldi, P., Ribeiro-Neto, B.A., Cambazoglu, B.B. (eds.) Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, Barcelona, Spain, 9-11 February 2009, pp. 222–231. ACM, New York (2009)

  24. Li, W., Zhang, Y., Sun, Y., Wang, W., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement (v1.0). CoRR (2016). arXiv:1610.02455

  25. Liu, W., Wang, H., Zhang, Y., Wang, W., Qin, L.: I-lsh: I/o efficient c-approximate nearest neighbor search in high-dimensional space. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1670–1673. IEEE (2019)

  26. Liu, Y., Cheng, H., Cui, J.: PQBF: i/o-efficient approximate nearest neighbor search by product quantization. In: CIKM, pp. 667–676 (2017)

  27. Liu, Y., Cui, J., Huang, Z., Li, H., Shen, H.T.: Sk-lsh: an efficient index structure for approximate nearest neighbor search. Proc. VLDB Endow. 7(9), 745–756 (2014)

    Article  Google Scholar 

  28. Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe lsh: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 950–961. VLDB Endowment (2007)

  29. Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. CoRR (2016)

  30. Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)

    Article  Google Scholar 

  31. Pan, J., Manocha, D.: Bi-level locality sensitive hashing for k-nearest neighbor computation. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 378–389. IEEE (2012)

  32. Panigrahy, R.: Entropy based nearest neighbor search in high dimensions. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, 22-26 January 2006, pp. 1186–1195 (2006)

  33. Park, Y., Cafarella, M.J., Mozafari, B.: Neighbor-sensitive hashing. PVLDB 9(3), 144–155 (2015)

    Google Scholar 

  34. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: ACM SIGKDD, pp. 701–710 (2014)

  35. Schenkel, R., Broschart, A., Hwang, S., Theobald, M., Weikum, G.: Efficient text proximity search. In: Ziviani, N., Baeza-Yates, R.A. (eds.) String Processing and Information Retrieval, 14th International Symposium, SPIRE 2007, Santiago, Chile, 29–31 October 2007, Proceedings, Lecture Notes in Computer Science, vol. 4726, pp. 287–299. Springer, Berlin (2007)

  36. Silpa-Anan, C., Hartley, R.I.: Optimised kd-trees for fast image descriptor matching. In: CVPR (2008)

  37. Sun, Y., Wang, W., Qin, J., Zhang, Y., Lin, X.: SRS: solving c-approximate nearest neighbor queries in high dimensional Euclidean space with a tiny index. Proc. VLDB Endow. 8(1), 1–12 (2014)

    Article  Google Scholar 

  38. Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 563–576. ACM, New York (2009)

  39. Theobald, M., Bast, H., Majumdar, D., Schenkel, R., Weikum, G.: Topx: efficient and versatile top- k query processing for semistructured data. VLDB J. 17(1), 81–115 (2008)

    Article  Google Scholar 

  40. Wang, J., Huang, P.,Zhao, H., Zhang, Z., Zhao, B., Lee, D.L.: Billion-scale commodity embedding for e-commerce recommendation in Alibaba. In: ACM SIGKDD, pp. 839–848 (2018)

  41. Wang, J., Zhang, T., Song, J., Sebe, N., Shen, H.T.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 769–790 (2018)

    Article  Google Scholar 

  42. Wang, Y., Shrivastava, A., Ryu, J.: Flash: randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search (2017). arXiv preprint arXiv:1709.01190

  43. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS, pp. 1753–1760 (2008)

  44. Zhang, J., Khoram, S., Li, J.: Efficient large-scale approximate nearest neighbor search on OpenCL FPGA. In: CVPR, pp. 4924–4932 (2018)

  45. Zheng, Y., Guo, Q., Tung, A.K., Wu, S.: Lazylsh: approximate nearest neighbor search for multiple distance functions with a single index. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2023–2037. ACM, New York (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hanchen Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, W., Wang, H., Zhang, Y. et al. EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search. The VLDB Journal 30, 215–235 (2021). https://doi.org/10.1007/s00778-020-00635-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-020-00635-4

Keywords

Navigation