Abstract
Nearest neighbor in high-dimensional space has been widely used in various fields such as databases, data mining and machine learning. The problem has been well solved in low-dimensional space. However, when it comes to high-dimensional space, due to the curse of dimensionality, the problem is challenging. As a trade-off between accuracy and efficiency, c-approximate nearest neighbor (c-ANN) is considered instead of an exact NN search in high-dimensional space. A variety of c-ANN algorithms have been proposed, one of the important schemes for the c-ANN problem is called Locality-sensitive hashing (LSH), which projects a high-dimensional dataset into a low-dimensional dataset and can return a c-ANN with a constant probability. In this paper, we propose a new aggressive early-termination (ET) condition which stops the algorithm with LSH scheme earlier under the same theoretical guarantee, leading to a smaller I/O cost and less running time. Unlike the “conservative” early termination conditions used in previous studies, we propose an “aggressive” early termination condition which can stop much earlier. Though it is not absolutely safe and may result in the probability of failure, we can still devise more efficient algorithms under the same theoretical guarantee by carefully considering the failure probabilities brought by LSH scheme and early termination. Furthermore, we also introduce an incremental searching strategy. Unlike the previous LSH methods, which expand the bucket width in an exponential way, we employ a more natural search strategy to incrementally access the hash values of the objects. We also provide a rigorous theoretical analysis to underpin our incremental search strategy and the new early termination technique. Our comprehensive experiment results show that, compared with the state-of-the-art I/O efficient c-ANN techniques, our proposed algorithm, namely EI-LSH, can achieve much better I/O efficiency under the same theoretical guarantee.
Similar content being viewed by others
Notes
Note that each entry takes 8 bytes for one hash value and the object ID.
Here, it is not necessary that the instance belongs to A.
References
Arora, A., Sinha, S., Kumar, P., Bhattacharya, A.: Hd-index: pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. Proce. VLDB Endow. 11(8), 906–919 (2018)
Bahmani, B., Goel, A., Shinde R.: Efficient distributed locality sensitive hashing. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2174–2178. ACM, New York (2012)
Bast, H., Majumdar, D., Schenkel, R., Theobald, M., Weikum, G.: Io-top-k: index-access optimized top-k query processing. In: Dayal, U., Whang, K., Lomet, D.B., Alonso, G., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y. (eds.) Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006, pp. 475–486. ACM, New York (2006)
Bernhardsson,E.: Annoy at github. https://github.com/spotify/annoy (2015)
Chen, D., Sun, G., Gong, N.Z., Zhong, X.: Efficient top-k query algorithms using density index. In: Zeng, D. (ed.) Applied Informatics and Communication—International Conference, ICAIC 2011, Xi’an, China, 20–21 August 2011, Proceedings, Part I, Communications in Computer and Information Science, vol. 224, pp. 38–45. Springer, Berlin (2011)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. ACM, New York (2004)
Deshpande, P.M., Padmanabhan, D., Kummamuru, K.: Efficient online top-k retrieval with arbitrary similarity measures. In: Kemper, A., Valduriez, P., Mouaddib, N., Teubner, J., Bouzeghoub, M., Markl, V., Amsaleg, L., Manolescu, I. (eds.) Proceedings of EDBT 2008, 11th International Conference on Extending Database Technology, Nantes, France, 25–29 March 2008, ACM International Conference Proceeding Series, vol. 261, pp. 356–367. ACM, New York (2008)
Dong, W., Charikar, M., Li, K: Efficient k-nearest neighbor graph construction for generic similarity measures. In: WWW (2011)
Fagin, R.: Combining fuzzy information: an overview. SIGMOD Rec. 31(2), 109–118 (2002)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)
Gan, J., Feng, J., Fang, Q., Ng, W.: Locality-sensitive hashing scheme based on dynamic collision counting. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 541–552. ACM, New York (2012)
Gao, J., Jagadish, H.V., Lu, W., Ooi, B.C.: DSH: data sensitive hashing for high-dimensional k-nnsearch. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1127–1138. ACM, New York (2014)
Gao, J., Jagadish, H.V., Ooi, B.C., Wang, S.: Selective hashing: closing the gap between radius search and k-nn search. In: SIGKDD (2015)
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: ACM SIGKDD, pp. 855–864 (2016)
Gu, Y., Guo, Y., Song, Y., Zhou, X., Yu, G.: Approximate order-sensitive k-nn queries over correlated high-dimensional data. IEEE Trans. Knowl. Data Eng. 1, 1–1 (2018)
Haghani, P.,Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 744–755. ACM, New York (2009)
Holland, S.M.: Principal components analysis (PCA). Department of Geology, University of Georgia, Athens, GA, pp. 30602–2501 (2008)
Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proc. VLDB Endow. 9(1), 1–12 (2015)
Indyk, P. Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, 23-26 May 1998, pp. 604–613 (1998)
Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. CoRR (2017) arXiv:1702.08734
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Kumar, R., Punera, K., Suel, T., Vassilvitskii, S.: Top-k aggregation using intersections of ranked inputs. In: Baeza-Yates, R., Boldi, P., Ribeiro-Neto, B.A., Cambazoglu, B.B. (eds.) Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, Barcelona, Spain, 9-11 February 2009, pp. 222–231. ACM, New York (2009)
Li, W., Zhang, Y., Sun, Y., Wang, W., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement (v1.0). CoRR (2016). arXiv:1610.02455
Liu, W., Wang, H., Zhang, Y., Wang, W., Qin, L.: I-lsh: I/o efficient c-approximate nearest neighbor search in high-dimensional space. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1670–1673. IEEE (2019)
Liu, Y., Cheng, H., Cui, J.: PQBF: i/o-efficient approximate nearest neighbor search by product quantization. In: CIKM, pp. 667–676 (2017)
Liu, Y., Cui, J., Huang, Z., Li, H., Shen, H.T.: Sk-lsh: an efficient index structure for approximate nearest neighbor search. Proc. VLDB Endow. 7(9), 745–756 (2014)
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe lsh: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 950–961. VLDB Endowment (2007)
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. CoRR (2016)
Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)
Pan, J., Manocha, D.: Bi-level locality sensitive hashing for k-nearest neighbor computation. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 378–389. IEEE (2012)
Panigrahy, R.: Entropy based nearest neighbor search in high dimensions. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, 22-26 January 2006, pp. 1186–1195 (2006)
Park, Y., Cafarella, M.J., Mozafari, B.: Neighbor-sensitive hashing. PVLDB 9(3), 144–155 (2015)
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: ACM SIGKDD, pp. 701–710 (2014)
Schenkel, R., Broschart, A., Hwang, S., Theobald, M., Weikum, G.: Efficient text proximity search. In: Ziviani, N., Baeza-Yates, R.A. (eds.) String Processing and Information Retrieval, 14th International Symposium, SPIRE 2007, Santiago, Chile, 29–31 October 2007, Proceedings, Lecture Notes in Computer Science, vol. 4726, pp. 287–299. Springer, Berlin (2007)
Silpa-Anan, C., Hartley, R.I.: Optimised kd-trees for fast image descriptor matching. In: CVPR (2008)
Sun, Y., Wang, W., Qin, J., Zhang, Y., Lin, X.: SRS: solving c-approximate nearest neighbor queries in high dimensional Euclidean space with a tiny index. Proc. VLDB Endow. 8(1), 1–12 (2014)
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 563–576. ACM, New York (2009)
Theobald, M., Bast, H., Majumdar, D., Schenkel, R., Weikum, G.: Topx: efficient and versatile top- k query processing for semistructured data. VLDB J. 17(1), 81–115 (2008)
Wang, J., Huang, P.,Zhao, H., Zhang, Z., Zhao, B., Lee, D.L.: Billion-scale commodity embedding for e-commerce recommendation in Alibaba. In: ACM SIGKDD, pp. 839–848 (2018)
Wang, J., Zhang, T., Song, J., Sebe, N., Shen, H.T.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 769–790 (2018)
Wang, Y., Shrivastava, A., Ryu, J.: Flash: randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search (2017). arXiv preprint arXiv:1709.01190
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS, pp. 1753–1760 (2008)
Zhang, J., Khoram, S., Li, J.: Efficient large-scale approximate nearest neighbor search on OpenCL FPGA. In: CVPR, pp. 4924–4932 (2018)
Zheng, Y., Guo, Q., Tung, A.K., Wu, S.: Lazylsh: approximate nearest neighbor search for multiple distance functions with a single index. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2023–2037. ACM, New York (2016)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, W., Wang, H., Zhang, Y. et al. EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search. The VLDB Journal 30, 215–235 (2021). https://doi.org/10.1007/s00778-020-00635-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00635-4