当前位置: X-MOL 学术arXiv.cs.CG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Near-Optimal Explainable $k$-Means for All Dimensions
arXiv - CS - Computational Geometry Pub Date : 2021-06-29 , DOI: arxiv-2106.15566
Moses Charikar, Lunjia Hu

Many clustering algorithms are guided by certain cost functions such as the widely-used $k$-means cost. These algorithms divide data points into clusters with often complicated boundaries, creating difficulties in explaining the clustering decision. In a recent work, Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML'20) introduced explainable clustering, where the cluster boundaries are axis-parallel hyperplanes and the clustering is obtained by applying a decision tree to the data. The central question here is: how much does the explainability constraint increase the value of the cost function? Given $d$-dimensional data points, we show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k^{1 - 2/d}\mathrm{poly}(d\log k)$ times the minimum cost achievable by a clustering without the explainability constraint, assuming $k,d\ge 2$. Combining this with an independent work by Makarychev and Shan (ICML'21), we get an improved bound of $k^{1 - 2/d}\mathrm{polylog}(k)$, which we show is optimal for every choice of $k,d\ge 2$ up to a poly-logarithmic factor in $k$. For $d = 2$ in particular, we show an $O(\log k\log\log k)$ bound, improving exponentially over the previous best bound of $\widetilde O(k)$.

中文翻译:

所有维度的近最优可解释 $k$-Means

许多聚类算法由某些成本函数指导,例如广泛使用的 $k$-means 成本。这些算法将数据点划分为边界通常很复杂的集群,这给解释集群决策带来了困难。在最近的一项工作中,Dasgupta、Frost、Moshkovitz 和 Rashtchian (ICML'20) 引入了可解释聚类,其中聚类边界是轴平行超平面,聚类是通过将决策树应用于数据来获得的。这里的核心问题是:可解释性约束使成本函数的价值增加了​​多少?给定 $d$ 维数据点,我们展示了一种有效的算法,该算法可以找到一个可解释的聚类,其 $k$-means 成本最多为 $k^{1 - 2/d}\mathrm{poly}(d\log k)$ 乘以聚类可实现的最小成本没有可解释性约束,假设 $k,d\ge 2$。将其与 Makarychev 和 Shan (ICML'21) 的独立工作相结合,我们得到了 $k^{1 - 2/d}\mathrm{polylog}(k)$ 的改进界限,我们证明它对每个选择都是最佳的$k,d\ge 2$ 直到 $k$ 中的多对数因子。特别是对于 $d = 2$,我们展示了 $O(\log k\log\log k)$ 的边界,比之前的 $\widetilde O(k)$ 的最佳边界呈指数级增长。d\ge 2$ 直到以 $k$ 为单位的多对数因子。特别是对于 $d = 2$,我们展示了 $O(\log k\log\log k)$ 的边界,比之前的 $\widetilde O(k)$ 的最佳边界呈指数级增长。d\ge 2$ 直到以 $k$ 为单位的多对数因子。特别是对于 $d = 2$,我们展示了 $O(\log k\log\log k)$ 的边界,比之前的 $\widetilde O(k)$ 的最佳边界呈指数级增长。
更新日期:2021-06-30
down
wechat
bug