当前位置: X-MOL 学术ACM Trans. Intell. Syst. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast Distributed k NN Graph Construction Using Auto-tuned Locality-sensitive Hashing
ACM Transactions on Intelligent Systems and Technology ( IF 5 ) Pub Date : 2020-10-12 , DOI: 10.1145/3408889
Carlos Eiras-Franco 1 , David Martínez-Rego 1 , Leslie Kanthan 2 , César Piñeiro 3 , Antonio Bahamonde 4 , Bertha Guijarro-Berdiñas 1 , Amparo Alonso-Betanzos 1
Affiliation  

The k -nearest-neighbors ( k NN) graph is a popular and powerful data structure that is used in various areas of Data Science, but the high computational cost of obtaining it hinders its use on large datasets. Approximate solutions have been described in the literature using diverse techniques, among which Locality-sensitive Hashing (LSH) is a promising alternative that still has unsolved problems. We present Variable Resolution Locality-sensitive Hashing, an algorithm that addresses these problems to obtain an approximate k NN graph at a significantly reduced computational cost. Its usability is greatly enhanced by its capacity to automatically find adequate hyperparameter values, a common hindrance to LSH-based methods. Moreover, we provide an implementation in the distributed computing framework Apache Spark that takes advantage of the structure of the algorithm to efficiently distribute the computational load across multiple machines, enabling practitioners to apply this solution to very large datasets. Experimental results show that our method offers significant improvements over the state-of-the-art in the field and shows very good scalability as more machines are added to the computation.

中文翻译:

使用自动调整局部敏感哈希的快速分布式 k NN 图构建

ķ-最近邻(ķNN) 图是一种流行且强大的数据结构,用于数据科学的各个领域,但获取它的高计算成本阻碍了它在大型数据集上的使用。文献中已经使用多种技术描述了近似解决方案,其中局部敏感哈希 (LSH) 是一种很有前途的替代方案,但仍有未解决的问题。我们提出了可变分辨率局部敏感哈希,一种解决这些问题以获得近似值的算法ķNN 图以显着降低的计算成本。它的可用性因其自动找到足够的超参数值的能力而大大增强,这是基于 LSH 的方法的常见障碍。此外,我们在分布式计算框架 Apache Spark 中提供了一个实现,它利用算法的结构来有效地将计算负载分布在多台机器上,使从业者能够将该解决方案应用于非常大的数据集。实验结果表明,我们的方法比该领域的最新技术提供了显着改进,并且随着更多机器被添加到计算中,显示出非常好的可扩展性。
更新日期:2020-10-12
down
wechat
bug