Information Systems ( IF 3.7 ) Pub Date : 2021-05-21 , DOI: 10.1016/j.is.2021.101808 Lucia Vadicamo , Richard Connor , Edgar Chávez
In high dimensional datasets, exact indexes are ineffective for proximity queries, and a sequential scan over the entire dataset is unavoidable. Accepting this, here we present a new approach employing two-dimensional embeddings. Each database element is mapped to the plane using the four-point property. The caveat is that the mapping is local: in other words, each object is mapped using a different mapping.
The idea is that each element of the data is associated with a pair of reference objects that is well-suited to filter that particular object, in cases where it is not relevant to a query. This maximises the probability of excluding that object from a search. At query time, a query is compared with a pool of reference objects which allow its mapping to all the planes used by data objects. Then, for each query/object pair, a lower bound of the actual distance is obtained. The technique can be applied to any metric space that possesses the four-point property, therefore including Euclidean, Cosine, Triangular, Jensen–Shannon, and Quadratic Form distances.
Our experiments show that for all the datasets tested, of varying dimensionality, our approach can filter more objects than a standard metric indexing approach. For low dimensional data this does not make a good search mechanism in its own right, as it does not scale with the size of the data: that is, its cost is linear with respect to the data size. However, we also show that it can be added as a post-filter to other mechanisms, increasing efficiency with little extra cost in space or time. For high-dimensional data, we show related approximate techniques which, we believe, give the best known compromise for speeding up the essential sequential scan. The potential uses of our filtering technique include pure GPU searching, taking advantage of the tiny memory footprint of the mapping.
中文翻译:
使用二维局部嵌入的查询过滤
在高维数据集中,精确的索引对于邻近查询无效,因此不可避免地要对整个数据集进行顺序扫描。接受这一点,在这里我们提出一种采用二维嵌入的新方法。每个数据库元素都映射到平面使用四点属性。需要注意的是,映射是局部的:换句话说,每个对象都是使用不同的映射进行映射的。
这个想法是,数据的每个元素都与一对参考对象相关联,这些参考对象非常适合在与查询无关的情况下过滤该特定对象。这样可以最大程度地从搜索中排除该对象。在查询时,会将查询与参考对象池进行比较,该参考对象池可将其映射到数据对象使用的所有平面。然后,对于每个查询/对象对,获得实际距离的下限。该技术可以应用于具有四点属性的任何度量空间,因此包括欧几里得,余弦,三角形,詹森-香农和二次形式距离。
我们的实验表明,对于所测试的所有数据集,无论维数多少,我们的方法都可以比标准度量索引方法过滤更多的对象。对于低维数据,这本身并不能提供良好的搜索机制,因为它无法随数据大小扩展:也就是说,其成本相对于数据大小是线性的。但是,我们还表明,可以将它作为后过滤器添加到其他机制中,从而在几乎不增加空间或时间成本的情况下提高了效率。对于高维数据,我们显示了相关的近似技术,我们相信这些近似技术会为加快基本顺序扫描的速度提供最著名的折衷方案。我们的过滤技术的潜在用途包括利用映射的微小内存占用来进行纯GPU搜索。