当前位置: X-MOL 学术J. Appl. Probab. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Randomized near-neighbor graphs, giant components and applications in data science
Journal of Applied Probability ( IF 0.7 ) Pub Date : 2020-07-16 , DOI: 10.1017/jpr.2020.21
George C Linderman 1 , Gal Mishne 1 , Ariel Jaffe 1 , Yuval Kluger 2 , Stefan Steinerberger 3
Affiliation  

If we pick n random points uniformly in $[0,1]^d$ and connect each point to its $c_d \log{n}$ nearest neighbors, where $d\ge 2$ is the dimension and $c_d$ is a constant depending on the dimension, then it is well known that the graph is connected with high probability. We prove that it suffices to connect every point to $ c_{d,1} \log{\log{n}}$ points chosen randomly among its $ c_{d,2} \log{n}$ nearest neighbors to ensure a giant component of size $n - o(n)$ with high probability. This construction yields a much sparser random graph with $\sim n \log\log{n}$ instead of $\sim n \log{n}$ edges that has comparable connectivity properties. This result has non-trivial implications for problems in data science where an affinity matrix is constructed: instead of connecting each point to its k nearest neighbors, one can often pick $k'\ll k$ random points out of the k nearest neighbors and only connect to those without sacrificing quality of results. This approach can simplify and accelerate computation; we illustrate this with experimental results in spectral clustering of large-scale datasets.

中文翻译:


数据科学中的随机近邻图、巨型组件和应用



如果我们选择n随机点均匀分布在$[0,1]^d$并将每个点连接到它的$c_d \log{n}$最近的邻居,其中$d\ge 2$是维度并且$c_d$是一个取决于维度的常数,那么众所周知,该图以高概率连通。我们证明只要将每个点连接到$ c_{d,1} \log{\log{n}}$其间随机选择的点$ c_{d,2} \log{n}$最近邻以确保尺寸的巨大组成部分$n - o(n)$有很高的概率。这种构造产生了一个稀疏得多的随机图$\sim n \log\log{n}$而不是$\sim n \log{n}$具有可比连通性的边。这个结果对于构建亲和力矩阵的数据科学问题具有重要意义:而不是将每个点与其对应的点连接起来。 k最近的邻居,通常可以选择$k'\ll k$中的随机点k最近的邻居,并且只连接到那些不牺牲结果质量的邻居。 这种方法可以简化并加速计算;我们用大规模数据集谱聚类的实验结果来说明这一点。
更新日期:2020-07-16
down
wechat
bug