Skip to main content
Log in

DyFT: a dynamic similarity search method on integer sketches

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques produce binary sketches, recent ones produce integer sketches for preserving various similarity measures. However, most similarity search methods are designed for binary sketches and inefficient for integer sketches. Moreover, most methods are either inapplicable or inefficient for dynamic datasets, although modern real-world datasets are updated over time. We propose dynamic filter trie (DyFT), a dynamic similarity search method for both binary and integer sketches. An extensive experimental analysis using large real-world datasets shows that DyFT performs superiorly with respect to scalability, time performance, and memory efficiency. For example, on a huge dataset of 216 million data points, DyFT performs a similarity search 6000 times faster than a state-of-the-art method while reducing to one-thirteenth in memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Arslan AN, Eǧecioǧlu Ö (2002) Dictionary look-up within small edit distance. In: Proceedings of the 8th international computing and combinatorics conference (COCOON). pp 127–136

  2. Askitis N, Sinha R (2010) Engineering scalable, cache and space efficient tries for strings. VLDB J 19(5):633–660

    Article  Google Scholar 

  3. Batko M, Falchi F, Lucchese C, Novak D, Perego R, Rabitti F, Sedmidubsky J, Zezula P (2010) Building a web-scale image similarity search system. Multimedia Tools Appl 47(3):599–629

    Article  Google Scholar 

  4. Belazzougui D, Venturini R (2012) Compressed string dictionary look-up with edit distance one. In: Proceedings of the 23rd annual symposium on combinatorial pattern matching (CPM). pp 280–292

  5. Binna R, Zangerle E, Pichl M, Specht G, Leis V (2018) HOT: a height optimized trie index for main-memory database systems. In: Proceedings of the 2018 ACM SIGMOD international conference on management of data. pp 521–534

  6. Boehm M, Schlegel B, Volk PB, Fischer U, Habich D, Lehner W (2011) Efficient in-memory indexing with generalized prefix trees. In: Proceedings of the 14th BTW conference on database systems for business, technology, and web. pp 227–246

  7. Boytsov L (2011) Indexing methods for approximate dictionary searching: comparative analysis. J Exp Algorithm (JEA) 16:1

    MathSciNet  MATH  Google Scholar 

  8. Cao Y, Qi H, Zhou W, Kato J, Li K, Liu X, Gui J (2018) Binary hashing for approximate nearest neighbor search on big data: a survey. IEEE Access 6:2039–2054

    Article  Google Scholar 

  9. Chan H-L, Lam T-W, Sung W-K, Tam S-L, Wong S-S (2010) Compressed indexes for approximate string matching. Algorithmica 58(2):263–281

    Article  MathSciNet  Google Scholar 

  10. Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th annual ACM symposium on theory of computing (STOC). pp 380–388

  11. Chuang J-C, Cho C-W, Chen ALP (2006) Similarity search in transaction databases with a two-level bounding mechanism. In: Proceedings of the international conference on database systems for advanced applications (DASFAA). pp 572–586

  12. Cole R, Gottlieb L-A, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th annual ACM symposium on theory of computing (STOC). pp 91–100

  13. Driemel A, Silvestri F (2017) Locality-sensitive hashing of curves. In: Proceedings of the 33rd international symposium on computational geometry (SoCG)

  14. Eghbali S, Ashtiani H, Tahvildari L (2020) Online nearest neighbor search using Hamming weight trees. IEEE Trans Pattern Anal Mach Intell 42(7):1729–1740

    Article  Google Scholar 

  15. Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499

    Article  Google Scholar 

  16. Gog S, Venturini R (2016) Fast and compact Hamming distance index. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. pp 285–294

  17. Greene D, Parnas M, Yao F (1994) Multi-index hashing for information retrieval. In: Proceedings of the 35th annual symposium on foundations of computer science (FOCS). pp 722–731

  18. Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(2):192–223

    Article  Google Scholar 

  19. Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. pp 284–291

  20. Ito J-I, Tabei Y, Shimizu K, Tsuda K, Tomii K (2012) PoSSuM: a database of similar protein-ligand binding and putative pockets. Nucleic Acids Res 40:D541–D548

    Article  Google Scholar 

  21. Kanda S, Tabei Y (2019) b-bit sketch trie: scalable similarity search on integer sketches. In: Proceedings of the 2019 IEEE international conference on big data. pp 810–819

  22. Kanda S, Tabei Y (2020) Dynamic similarity search on integer sketches. In: Proceedings of the 20th IEEE international conference on data mining (ICDM). pp 242–251

  23. Kanda S, Takeuchi K, Fujii K, Tabei Y (2020) Succinct trit-array trie for scalable trajectory similarity search. In: Proceedings of the 28th ACM SIGSPATIAL international conference on advances in geographic information systems (SIGSPATIAL). pp 518–529

  24. Kuhn M, Szklarczyk D, Franceschini A, Campillos M, von Mering C, Jensen LJ, Beyer A, Bork P (2009) STITCH 2: an interaction network database for small molecules and proteins. Nucleic Acids Res 38(suppl-1):D552–D556

    Google Scholar 

  25. Leis V, Kemper A, Neumann T (2013) The adaptive radix tree: ARTful indexing for main-memory databases. In: Proceedings of the IEEE 29th international conference on data engineering (ICDE). pp 38–49

  26. Li P (2015) 0-bit consistent weighted sampling. In: Proceedings of the 21th ACM SIGKDD International conference on knowledge discovery and data mining. pp 665–674

  27. Li P (2017) Linearized GMM kernels and normalized random Fourier features. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. pp 315–324

  28. Li P, König C (2010) b-Bit minwise hashing. In: Proceedings of the 19th international conference on World Wide Web (WWW). pp 671–680

  29. Li P, Lu H, Zheng Q, Yang L, Pan G (2020) LISA: a learned index structure for spatial data. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data. pp 2119–2133

  30. Loosli G, Canu S, Bottou L (2007) Training invariant support vector machines using selective sampling. In: Large scale kernel machines. pp 301–320

  31. McAuley J, Leskovec J (2013) Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM conference on recommender systems (RecSys). pp 165–172

  32. Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the international conference on language resources and evaluation (LREC)’

  33. Norouzi M, Punjani A, Fleet DJ (2014) Fast exact search in Hamming space with multi-index hashing. IEEE Trans Pattern Anal Mach Intell 36(6):1107–1119

    Article  Google Scholar 

  34. Qin J, Xiao C, Wang Y, Wang W (2021) Generalizing the pigeonhole principle for similarity search in Hamming space. IEEE Trans Knowl Data Eng 33(2):489–505

    Google Scholar 

  35. Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. pp 785–796

  36. Sundaram N, Turmukhametova A, Satish N, Mostak T, Indyk P, Madden S, Dubey P (2013) Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc VLDB Endow 6(14):1930–1941

    Article  Google Scholar 

  37. Tabei Y, Tsuda K (2011) Sketchsort: fast all pairs similarity search for large databases of molecular fingerprints. Mol Inform 30(9):801–807

    Article  Google Scholar 

  38. Weng Z, Zhu Y (2019) Online supervised sketching hashing for large-scale image retrieval. IEEE Access 7:88369–88379

    Article  Google Scholar 

  39. Yang R, Niu B (2020) Continuous K nearest neighbor queries over large-scale spatial-textual data streams. ISPRS Int J Geo-Inf 9(11):694

    Article  Google Scholar 

  40. Yoshinaga N, Kitsuregawa M (2014) A self-adaptive classifier for efficient text-stream processing. In: Proceedings of the 24th international conference on computational linguistics (COLING). pp 1091–1102

  41. Zhang H, Lim H, Leis V, Andersen DG, Kaminsky M, Keeton K, Pavlo A (2018) SuRF: practical range query filtering with fast succinct tries. In: Proceedings of the 2018 ACM SIGMOD international conference on management of data. pp 323–336

  42. Zhang X, Qin J, Wang W, Sun Y, Lu J (2013) HmSearch: an efficient Hamming distance query processing algorithm. In: Proceedings of the 25th international conference on scientific and statistical database management (SSDBM). p 19

Download references

Acknowledgements

This work was supported by JST AIP-PRISM (Grant Number JPMJCR18Y5). We thank the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shunsuke Kanda.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Implementation details

Implementation details

Although our experimental code is publicly available at https://github.com/kampersanda/dyft, we describe some of the implementation details for the interested reader.

Management of database For all the methods, a database of sketches \(X =\{x_1,x_2, \dots ,x_n\}\) has to be stored to quickly compute the Hamming distance \(H(x_i,y)\) for final verification to query sketch y. We describe physical representations of binary and integer sketches, which were applied to all the methods in our experiments. Since the length of a sketch m was set to 32 or 64 in our experiments, we consider \(m = O(w)\) for a word size w.

A binary sketch was straightforwardly stored in a binary format since the Hamming distance can be computed in O(1) by exploiting bit-parallelism offered by CPUs: \(H(x,y) = \textsf {Popcnt}(x \oplus y)\), where \(\oplus \) is a bitwise-XOR operation and \(\textsf {Popcnt}(\cdot )\) counts the number of 1s. Popcnt is known as a population count operation and is supported in modern CPU instruction sets. We used the built-in GCC function __builtin_popcount for this.

For integer sketches, we employed the generalization of the computation approach for binary sketches, which was proposed by Zhang et al. [42]. This approach encodes x into \(\hat{x}\) in a vertical format, i.e., the i-th significant m bits of each character of x are stored to \(\hat{x}[i]\) of consecutive m bits. Given sketches \(\hat{x}\) and \(\hat{y}\) in the vertical format, we can compute H(xy) as follows. We initialize a bitmap t of m bits in which all the bits are set to zero. For each \(i = 0,1,\ldots ,\lceil {\log _2 \sigma }\rceil -1\), we iteratively perform \(t \leftarrow t \vee (\hat{x}[i] \oplus \hat{y}[i])\), where \(\vee \) is a bitwise-OR operation. \(\textsf {Popcnt}(t)\) for the resulting t is the same as H(xy). The computation time is \(O(\log \sigma )\). Thus, we stored integer sketches in the vertical format and employed the computation approach.

Assignment of smaller Hamming radii DyFT\(^+\) employs the multi-index approach [17] to leverage a DyFT search for a large Hamming radius. In a traditional manner, it assigns radius \(\lfloor {r/q}\rfloor \) to each block not to produce false negatives based on the (basic) pigeonhole principle. Recently, Qin et al. [34] developed the general pigeonhole principle that allows the multi-index approach to assign smaller radii than \(\lfloor {r/q}\rfloor \). It ensures that the multi-index search does not produce false negatives when the sum of radii assigned for each block is \(r - q + 1\). In DyFT\(^+\) and MIH, we assigned radii for each block in a round-robin manner until the total was \(r - q + 1\).

Implementation of DyFT pointers Although our method can significantly reduce the number of DyFT nodes, the main memory-consuming part is still the representation of DyFT nodes. Pointers on a 64-bit machine can be very expensive for representing DyFT nodes. Even for the largest dataset CP216M, the number of DyFT nodes was 217 million and less than \(2^{27}\) when \(\sigma = 16\) and \(r=1\). Hence, we reserved a consecutive memory block to store DyFT nodes and implemented a DyFT pointer using a 32-bit integer within the memory block. The memory block was expanded by doubling its size. The modification is fair to the competitors since the implementations of MIH and HWT also assume that the database size n can be represented in a 32-bit integer.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kanda, S., Tabei, Y. DyFT: a dynamic similarity search method on integer sketches. Knowl Inf Syst 63, 2815–2840 (2021). https://doi.org/10.1007/s10115-021-01611-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01611-2

Keywords

Navigation