DyFT: a dynamic similarity search method on integer sketches

Kanda, Shunsuke; Tabei, Yasuo

doi:10.1007/s10115-021-01611-2

DyFT: a dynamic similarity search method on integer sketches

Regular paper
Published: 09 September 2021

Volume 63, pages 2815–2840, (2021)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

241 Accesses
5 Altmetric
Explore all metrics

Abstract

Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques produce binary sketches, recent ones produce integer sketches for preserving various similarity measures. However, most similarity search methods are designed for binary sketches and inefficient for integer sketches. Moreover, most methods are either inapplicable or inefficient for dynamic datasets, although modern real-world datasets are updated over time. We propose dynamic filter trie (DyFT), a dynamic similarity search method for both binary and integer sketches. An extensive experimental analysis using large real-world datasets shows that DyFT performs superiorly with respect to scalability, time performance, and memory efficiency. For example, on a huge dataset of 216 million data points, DyFT performs a similarity search 6000 times faster than a state-of-the-art method while reducing to one-thirteenth in memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Rashmin Gajera, Suresh Patel, … Ayush Solanki

A survey of density based clustering algorithms

Article 29 September 2020

Panthadeep Bhattacharjee & Pinaki Mitra

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Article Open access 15 January 2021

Hai Lan, Zhifeng Bao & Yuwei Peng

References

Arslan AN, Eǧecioǧlu Ö (2002) Dictionary look-up within small edit distance. In: Proceedings of the 8th international computing and combinatorics conference (COCOON). pp 127–136
Askitis N, Sinha R (2010) Engineering scalable, cache and space efficient tries for strings. VLDB J 19(5):633–660
Article Google Scholar
Batko M, Falchi F, Lucchese C, Novak D, Perego R, Rabitti F, Sedmidubsky J, Zezula P (2010) Building a web-scale image similarity search system. Multimedia Tools Appl 47(3):599–629
Article Google Scholar
Belazzougui D, Venturini R (2012) Compressed string dictionary look-up with edit distance one. In: Proceedings of the 23rd annual symposium on combinatorial pattern matching (CPM). pp 280–292
Binna R, Zangerle E, Pichl M, Specht G, Leis V (2018) HOT: a height optimized trie index for main-memory database systems. In: Proceedings of the 2018 ACM SIGMOD international conference on management of data. pp 521–534
Boehm M, Schlegel B, Volk PB, Fischer U, Habich D, Lehner W (2011) Efficient in-memory indexing with generalized prefix trees. In: Proceedings of the 14th BTW conference on database systems for business, technology, and web. pp 227–246
Boytsov L (2011) Indexing methods for approximate dictionary searching: comparative analysis. J Exp Algorithm (JEA) 16:1
MathSciNet MATH Google Scholar
Cao Y, Qi H, Zhou W, Kato J, Li K, Liu X, Gui J (2018) Binary hashing for approximate nearest neighbor search on big data: a survey. IEEE Access 6:2039–2054
Article Google Scholar
Chan H-L, Lam T-W, Sung W-K, Tam S-L, Wong S-S (2010) Compressed indexes for approximate string matching. Algorithmica 58(2):263–281
Article MathSciNet Google Scholar
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th annual ACM symposium on theory of computing (STOC). pp 380–388
Chuang J-C, Cho C-W, Chen ALP (2006) Similarity search in transaction databases with a two-level bounding mechanism. In: Proceedings of the international conference on database systems for advanced applications (DASFAA). pp 572–586
Cole R, Gottlieb L-A, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th annual ACM symposium on theory of computing (STOC). pp 91–100
Driemel A, Silvestri F (2017) Locality-sensitive hashing of curves. In: Proceedings of the 33rd international symposium on computational geometry (SoCG)
Eghbali S, Ashtiani H, Tahvildari L (2020) Online nearest neighbor search using Hamming weight trees. IEEE Trans Pattern Anal Mach Intell 42(7):1729–1740
Article Google Scholar
Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499
Article Google Scholar
Gog S, Venturini R (2016) Fast and compact Hamming distance index. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. pp 285–294
Greene D, Parnas M, Yao F (1994) Multi-index hashing for information retrieval. In: Proceedings of the 35th annual symposium on foundations of computer science (FOCS). pp 722–731
Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(2):192–223
Article Google Scholar
Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. pp 284–291
Ito J-I, Tabei Y, Shimizu K, Tsuda K, Tomii K (2012) PoSSuM: a database of similar protein-ligand binding and putative pockets. Nucleic Acids Res 40:D541–D548
Article Google Scholar
Kanda S, Tabei Y (2019) b-bit sketch trie: scalable similarity search on integer sketches. In: Proceedings of the 2019 IEEE international conference on big data. pp 810–819
Kanda S, Tabei Y (2020) Dynamic similarity search on integer sketches. In: Proceedings of the 20th IEEE international conference on data mining (ICDM). pp 242–251
Kanda S, Takeuchi K, Fujii K, Tabei Y (2020) Succinct trit-array trie for scalable trajectory similarity search. In: Proceedings of the 28th ACM SIGSPATIAL international conference on advances in geographic information systems (SIGSPATIAL). pp 518–529
Kuhn M, Szklarczyk D, Franceschini A, Campillos M, von Mering C, Jensen LJ, Beyer A, Bork P (2009) STITCH 2: an interaction network database for small molecules and proteins. Nucleic Acids Res 38(suppl-1):D552–D556
Google Scholar
Leis V, Kemper A, Neumann T (2013) The adaptive radix tree: ARTful indexing for main-memory databases. In: Proceedings of the IEEE 29th international conference on data engineering (ICDE). pp 38–49
Li P (2015) 0-bit consistent weighted sampling. In: Proceedings of the 21th ACM SIGKDD International conference on knowledge discovery and data mining. pp 665–674
Li P (2017) Linearized GMM kernels and normalized random Fourier features. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. pp 315–324
Li P, König C (2010) b-Bit minwise hashing. In: Proceedings of the 19th international conference on World Wide Web (WWW). pp 671–680
Li P, Lu H, Zheng Q, Yang L, Pan G (2020) LISA: a learned index structure for spatial data. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data. pp 2119–2133
Loosli G, Canu S, Bottou L (2007) Training invariant support vector machines using selective sampling. In: Large scale kernel machines. pp 301–320
McAuley J, Leskovec J (2013) Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM conference on recommender systems (RecSys). pp 165–172
Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the international conference on language resources and evaluation (LREC)’
Norouzi M, Punjani A, Fleet DJ (2014) Fast exact search in Hamming space with multi-index hashing. IEEE Trans Pattern Anal Mach Intell 36(6):1107–1119
Article Google Scholar
Qin J, Xiao C, Wang Y, Wang W (2021) Generalizing the pigeonhole principle for similarity search in Hamming space. IEEE Trans Knowl Data Eng 33(2):489–505
Google Scholar
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. pp 785–796
Sundaram N, Turmukhametova A, Satish N, Mostak T, Indyk P, Madden S, Dubey P (2013) Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc VLDB Endow 6(14):1930–1941
Article Google Scholar
Tabei Y, Tsuda K (2011) Sketchsort: fast all pairs similarity search for large databases of molecular fingerprints. Mol Inform 30(9):801–807
Article Google Scholar
Weng Z, Zhu Y (2019) Online supervised sketching hashing for large-scale image retrieval. IEEE Access 7:88369–88379
Article Google Scholar
Yang R, Niu B (2020) Continuous K nearest neighbor queries over large-scale spatial-textual data streams. ISPRS Int J Geo-Inf 9(11):694
Article Google Scholar
Yoshinaga N, Kitsuregawa M (2014) A self-adaptive classifier for efficient text-stream processing. In: Proceedings of the 24th international conference on computational linguistics (COLING). pp 1091–1102
Zhang H, Lim H, Leis V, Andersen DG, Kaminsky M, Keeton K, Pavlo A (2018) SuRF: practical range query filtering with fast succinct tries. In: Proceedings of the 2018 ACM SIGMOD international conference on management of data. pp 323–336
Zhang X, Qin J, Wang W, Sun Y, Lu J (2013) HmSearch: an efficient Hamming distance query processing algorithm. In: Proceedings of the 25th international conference on scientific and statistical database management (SSDBM). p 19

Download references

Acknowledgements

This work was supported by JST AIP-PRISM (Grant Number JPMJCR18Y5). We thank the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Shunsuke Kanda & Yasuo Tabei

Authors

Shunsuke Kanda
View author publications
You can also search for this author in PubMed Google Scholar
Yasuo Tabei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shunsuke Kanda.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Implementation details

Although our experimental code is publicly available at https://github.com/kampersanda/dyft, we describe some of the implementation details for the interested reader.

Management of database For all the methods, a database of sketches \(X =\{x_1,x_2, \dots ,x_n\}\) has to be stored to quickly compute the Hamming distance \(H(x_i,y)\) for final verification to query sketch y. We describe physical representations of binary and integer sketches, which were applied to all the methods in our experiments. Since the length of a sketch m was set to 32 or 64 in our experiments, we consider \(m = O(w)\) for a word size w.

A binary sketch was straightforwardly stored in a binary format since the Hamming distance can be computed in O(1) by exploiting bit-parallelism offered by CPUs: \(H(x,y) = \textsf {Popcnt}(x \oplus y)\), where \(\oplus \) is a bitwise-XOR operation and \(\textsf {Popcnt}(\cdot )\) counts the number of 1s. Popcnt is known as a population count operation and is supported in modern CPU instruction sets. We used the built-in GCC function __builtin_popcount for this.

For integer sketches, we employed the generalization of the computation approach for binary sketches, which was proposed by Zhang et al. [42]. This approach encodes x into \(\hat{x}\) in a vertical format, i.e., the i-th significant m bits of each character of x are stored to \(\hat{x}[i]\) of consecutive m bits. Given sketches \(\hat{x}\) and \(\hat{y}\) in the vertical format, we can compute H(x, y) as follows. We initialize a bitmap t of m bits in which all the bits are set to zero. For each \(i = 0,1,\ldots ,\lceil {\log _2 \sigma }\rceil -1\), we iteratively perform \(t \leftarrow t \vee (\hat{x}[i] \oplus \hat{y}[i])\), where \(\vee \) is a bitwise-OR operation. \(\textsf {Popcnt}(t)\) for the resulting t is the same as H(x, y). The computation time is \(O(\log \sigma )\). Thus, we stored integer sketches in the vertical format and employed the computation approach.

Assignment of smaller Hamming radii DyFT\(^+\) employs the multi-index approach [17] to leverage a DyFT search for a large Hamming radius. In a traditional manner, it assigns radius \(\lfloor {r/q}\rfloor \) to each block not to produce false negatives based on the (basic) pigeonhole principle. Recently, Qin et al. [34] developed the general pigeonhole principle that allows the multi-index approach to assign smaller radii than \(\lfloor {r/q}\rfloor \). It ensures that the multi-index search does not produce false negatives when the sum of radii assigned for each block is \(r - q + 1\). In DyFT\(^+\) and MIH, we assigned radii for each block in a round-robin manner until the total was \(r - q + 1\).

Implementation of DyFT pointers Although our method can significantly reduce the number of DyFT nodes, the main memory-consuming part is still the representation of DyFT nodes. Pointers on a 64-bit machine can be very expensive for representing DyFT nodes. Even for the largest dataset CP216M, the number of DyFT nodes was 217 million and less than \(2^{27}\) when \(\sigma = 16\) and \(r=1\). Hence, we reserved a consecutive memory block to store DyFT nodes and implemented a DyFT pointer using a 32-bit integer within the memory block. The memory block was expanded by doubling its size. The modification is fair to the competitors since the implementations of MIH and HWT also assume that the database size n can be represented in a 32-bit integer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kanda, S., Tabei, Y. DyFT: a dynamic similarity search method on integer sketches. Knowl Inf Syst 63, 2815–2840 (2021). https://doi.org/10.1007/s10115-021-01611-2

Download citation

Received: 28 January 2021
Revised: 25 August 2021
Accepted: 28 August 2021
Published: 09 September 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s10115-021-01611-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DyFT: a dynamic similarity search method on integer sketches

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

A survey of density based clustering algorithms

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Implementation details

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DyFT: a dynamic similarity search method on integer sketches

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

A survey of density based clustering algorithms

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Implementation details

Implementation details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation