Randomized near-neighbor graphs, giant components and applications in data science,Journal of Applied Probability

当前位置： X-MOL 学术 › J. Appl. Probab. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Randomized near-neighbor graphs, giant components and applications in data science
Journal of Applied Probability ( IF 0.7 ) Pub Date : 2020-07-16 , DOI: 10.1017/jpr.2020.21
George C Linderman ₁ , Gal Mishne ₁ , Ariel Jaffe ₁ , Yuval Kluger ₂ , Stefan Steinerberger ₃

Affiliation

If we pick n random points uniformly in

$[0,1]^d$

and connect each point to its

$c_d \log{n}$

nearest neighbors, where

$d\ge 2$

is the dimension and

$c_d$

is a constant depending on the dimension, then it is well known that the graph is connected with high probability. We prove that it suffices to connect every point to

$ c_{d,1} \log{\log{n}}$

points chosen randomly among its

$ c_{d,2} \log{n}$

nearest neighbors to ensure a giant component of size

$n - o(n)$

with high probability. This construction yields a much sparser random graph with

$\sim n \log\log{n}$

instead of

$\sim n \log{n}$

edges that has comparable connectivity properties. This result has non-trivial implications for problems in data science where an affinity matrix is constructed: instead of connecting each point to its k nearest neighbors, one can often pick

$k'\ll k$

random points out of the k nearest neighbors and only connect to those without sacrificing quality of results. This approach can simplify and accelerate computation; we illustrate this with experimental results in spectral clustering of large-scale datasets.

中文翻译：

数据科学中的随机近邻图、巨型组件和应用

如果我们选择n随机点均匀分布在$[0,1]^d$并将每个点连接到它的$c_d \log{n}$最近的邻居，其中$d\ge 2$是维度并且$c_d$是一个取决于维度的常数，那么众所周知，该图以高概率连通。我们证明只要将每个点连接到$ c_{d,1} \log{\log{n}}$其间随机选择的点$ c_{d,2} \log{n}$最近邻以确保尺寸的巨大组成部分$n - o(n)$有很高的概率。这种构造产生了一个稀疏得多的随机图$\sim n \log\log{n}$而不是$\sim n \log{n}$具有可比连通性的边。这个结果对于构建亲和力矩阵的数据科学问题具有重要意义：而不是将每个点与其对应的点连接起来。 k最近的邻居，通常可以选择$k'\ll k$中的随机点k最近的邻居，并且只连接到那些不牺牲结果质量的邻居。这种方法可以简化并加速计算；我们用大规模数据集谱聚类的实验结果来说明这一点。

更新日期：2020-07-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11