当前位置: X-MOL 学术arXiv.cs.CG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On Efficient Low Distortion Ultrametric Embedding
arXiv - CS - Computational Geometry Pub Date : 2020-08-15 , DOI: arxiv-2008.06700
Vincent Cohen-Addad, Karthik C. S., and Guillaume Lagarde

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of $n$ points in $\Omega(\log n)$ dimensions exhibit a quite prohibitive running time of $\Theta(n^2)$. In this paper, we provide a new algorithm which takes as input a set of points $P$ in $\mathbb{R}^d$, and for every $c\ge 1$, runs in time $n^{1+\frac{\rho}{c^2}}$ (for some universal constant $\rho>1$) to output an ultrametric $\Delta$ such that for any two points $u,v$ in $P$, we have $\Delta(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the "best" ultrametric representation of $P$. Here, the best ultrametric is the ultrametric $\tilde\Delta$ that minimizes the maximum distance distortion with respect to the $\ell_2$ distance, namely that minimizes $\underset{u,v \in P}{\max}\ \frac{\tilde\Delta(u,v)}{\|u-v\|_2}$. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant $\varepsilon>0$, no algorithm with running time $n^{2-\varepsilon}$ can distinguish between inputs in $\ell_\infty$-metric that admit isometric embedding and those that incur a distortion of $\frac{3}{2}$. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

中文翻译:

关于高效低失真超测量嵌入

无监督学习和数据分析中的一个经典问题是找到更简单且易于可视化的数据表示,并保留其基本属性。在降低其复杂性的同时保留数据的底层层次结构的一种广泛使用的方法是将数据嵌入到树或超度量中。此任务最流行的算法是经典的链接算法(单个、平均或完整)。然而,这些方法在 $\Omega(\log n)$ 维度的 $n$ 个点的数据集上表现出非常令人望而却步的 $\Theta(n^2)$ 运行时间。在本文中,我们提供了一种新算法,该算法将 $\mathbb{R}^d$ 中的一组点 $P$ 作为输入,并且对于每个 $c\ge 1$,在时间 $n^{1+ \frac{\rho}{c^2}}$(对于一些通用常数 $\rho> 1$) 输出超度量 $\Delta$ 使得对于 $P$ 中的任何两个点 $u,v$,我们有 $\Delta(u,v)$ 与距离的乘法因子 $5c$ 之内在 $P$ 的“最佳”超度量表示中 $u$ 和 $v$ 之间。这里,最好的超度量是超度量 $\tilde\Delta$,它最小化相对于 $\ell_2$ 距离的最大距离失真,即最小化 $\underset{u,v \in P}{\max}\ \ frac{\tilde\Delta(u,v)}{\|uv\|_2}$. 我们通过表明在流行的复杂性理论假设下对上述结果进行补充,对于每个常数 $\varepsilon>0$,没有运行时间为 $n^{2-\varepsilon}$ 的算法可以区分 $\ell_\infty$ 中的输入-metric 允许等距嵌入和那些导致 $\frac{3}{2}$ 失真的度量。最后,
更新日期:2020-08-18
down
wechat
bug