当前位置: X-MOL 学术Distrib. Parallel. Databases › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High-dimensional similarity searches using query driven dynamic quantization and distributed indexing
Distributed and Parallel Databases ( IF 1.2 ) Pub Date : 2019-04-11 , DOI: 10.1007/s10619-019-07266-x
Gheorghi Guzun 1 , Guadalupe Canahuate 2
Affiliation  

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.

中文翻译:

使用查询驱动的动态量化和分布式索引进行高维相似性搜索

相似性的概念被用作许多数据探索和数据挖掘任务的基础。最近邻 (NN) 查询可识别最相似的项目,或根据距离最近点到查询点的方式。传统上使用多维特征向量之间的距离函数来表征相似性。然而,当数据是高维时,传统的距离函数无法显着区分最近点和最远点,因为很少有不同的维度支配距离函数。局部相似度函数,即只考虑与查询接近的维度的函数,独立量化每个维度,并且只计算查询和点落入同一个 bin 的维度的相似度。这些量化与查询无关,并且在使用依赖于查询的量化时有可能提高准确性。在这项工作中,我们提出了一种依赖于查询的等深度(QED)动态量化方法来改进高维相似性搜索。在查询时对每个维度进行量化,并为点的最接近的 p 部分生成局部分数,而对其余点应用恒定惩罚。QED 不仅提高了距离度量的质量,还通过过滤掉不相关的数据来提高查询时间性能。我们提出了一种分布式索引和查询算法来有效地计算 QED。我们的实验结果表明,在数百个维度的数据集上,分类准确性和查询性能的提高比基于曼哈顿的顺序扫描 NN 查询快一个数量级。此外,使用 QED 的相似性搜索显示出与维度数量和计算节点数量相关的线性或更好的可扩展性。
更新日期:2019-04-11
down
wechat
bug