当前位置: X-MOL 学术Theory Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees
Theory of Computing Systems ( IF 0.6 ) Pub Date : 2020-10-26 , DOI: 10.1007/s00224-020-10009-6
Yuliang Li , Jianguo Wang , Benjamin Pullman , Nuno Bandeira , Yannis Papakonstantinou

Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold θ. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The paper considers the efficient evaluation of such queries, as well as of the closely related top-k cosine similarity queries. It provides novel optimality guarantees that exhibit good performance on real datasets. We take as a starting point Fagin’s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for θ-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that multiple real-world data sets from mass spectrometry, natural language process, and computer vision exhibit a certain form of data skewness and we exploit this property to obtain better traversal strategies. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.



中文翻译:

基于索引的高维余弦阈值查询,具有最优性保证

给定向量数据库,余弦阈值查询将返回数据库中与在给定阈值θ以上的查询向量具有余弦相似度的所有向量。这些查询在许多应用程序中自然产生,例如文档检索,图像搜索和质谱分析。本文考虑了对此类查询以及紧密相关的前k个余弦相似度查询的有效评估。它提供了新颖的最优性保证,在真实数据集上表现出良好的性能。我们以Fagin著名的阈值算法(TA),可用于回答余弦阈值查询,如下所示:在预处理过程中,首先从数据库向量中构建一个倒排索引;在查询时,该算法部分遍历索引以收集一组候选向量,以供稍后验证θ-相似性。但是,直接申请助教原始形式会错过重大的优化机会。确实,我们首先表明,可以利用以下事实:可以将向量假定为归一化的,以获得改善的,严格的索引遍历停止条件并有效地递增计算。然后,我们证明了来自质谱,自然语言处理和计算机视觉的多个现实世界数据集表现出某种形式的数据偏斜,并且我们利用此属性来获得更好的遍历策略。我们表明,在偏度假设下,新的遍历策略具有强大的,接近最佳的性能保证。本文中开发的技术相当通用,因为它们可以应用于余弦以外的一大类相似函数。

更新日期:2020-10-30
down
wechat
bug