DyFT: a dynamic similarity search method on integer sketches,Knowledge and Information Systems

当前位置： X-MOL 学术 › Knowl. Inf. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DyFT: a dynamic similarity search method on integer sketches
Knowledge and Information Systems ( IF 2.5 ) Pub Date : 2021-09-09 , DOI: 10.1007/s10115-021-01611-2
Shunsuke Kanda ₁ , Yasuo Tabei ₁

Affiliation

Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques produce binary sketches, recent ones produce integer sketches for preserving various similarity measures. However, most similarity search methods are designed for binary sketches and inefficient for integer sketches. Moreover, most methods are either inapplicable or inefficient for dynamic datasets, although modern real-world datasets are updated over time. We propose dynamic filter trie (DyFT), a dynamic similarity search method for both binary and integer sketches. An extensive experimental analysis using large real-world datasets shows that DyFT performs superiorly with respect to scalability, time performance, and memory efficiency. For example, on a huge dataset of 216 million data points, DyFT performs a similarity search 6000 times faster than a state-of-the-art method while reducing to one-thirteenth in memory.

中文翻译：

DyFT：整数草图上的动态相似度搜索方法

相似性保持哈希是快速相似性搜索的核心技术，它将度量空间中的数据点随机映射到汉明空间中的离散符号串（即草图）。虽然传统的散列技术产生二进制草图，但最近的技术产生整数草图以保留各种相似性度量。然而，大多数相似性搜索方法是为二进制草图设计的，而对于整数草图则效率低下。此外，尽管现代现实世界的数据集会随着时间的推移而更新，但大多数方法对于动态数据集要么不适用，要么效率低下。我们提出了动态过滤器树（DyFT），这是一种用于二进制和整数草图的动态相似性搜索方法。使用大型真实世界数据集进行的广泛实验分析表明，DyFT 在可扩展性、时间性能、和内存效率。例如，在包含 2.16 亿个数据点的庞大数据集上，DyFT 执行相似度搜索的速度比最先进的方法快 6000 倍，同时内存减少到十分之一。

更新日期：2021-09-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11