当前位置: X-MOL 学术IEEE Trans. Knowl. Data. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Transformation-based Framework for KNN Set Similarity Search
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2020-03-01 , DOI: 10.1109/tkde.2018.2886189
Yong Zhang , Jiacheng Wu , Jin Wang , Chunxiao Xing

Set similarity search is a fundamental operation in a variety of applications. While many previous studies focus on threshold based set similarity search and join, few efforts have been paid for KNN set similarity search. In this paper, we propose a transformation based framework to solve the problem of KNN set similarity search, which given a collection of set records and a query set, returns $k$k results with the largest similarity to the query. We devise an effective transformation mechanism to transform sets with various lengths to fixed length vectors which can map similar sets closer to each other. Then, we index such vectors with a tiny tree structure. Next, we propose efficient search algorithms and pruning strategies to perform exact KNN set similarity search. We also design an estimation technique by leveraging the data distribution to support approximate KNN search, which can speed up the search while retaining high recall. Experimental results on real world datasets show that our framework significantly outperforms state-of-the-art methods in both memory and disk based settings.

中文翻译:

一种基于变换的 KNN 集相似性搜索框架

集合相似度搜索是各种应用程序中的基本操作。虽然以前的许多研究都集中在基于阈值的集合相似度搜索和连接上,但很少有人为 KNN 集合相似度搜索付出努力。在本文中,我们提出了一个基于转换的框架来解决 KNN 集相似性搜索的问题,它给定一组记录集和一个查询集,返回$千$与查询相似度最大的结果。我们设计了一种有效的转换机制,将具有不同长度的集合转换为固定长度的向量,这些向量可以将相似的集合映射得更接近。然后,我们用一个微小的树结构索引这些向量。接下来,我们提出了有效的搜索算法和剪枝策略来执行精确的 KNN 集相似性搜索。我们还设计了一种估计技术,通过利用数据分布来支持近似 KNN 搜索,可以在保持高召回率的同时加快搜索速度。在真实世界数据集上的实验结果表明,我们的框架在基于内存和磁盘的设置中都明显优于最先进的方法。
更新日期:2020-03-01
down
wechat
bug