Near-optimal large-scale k-medoids clustering,Information Sciences

当前位置： X-MOL 学术 › Inform. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Near-optimal large-scale k-medoids clustering
Information Sciences Pub Date : 2020-09-07 , DOI: 10.1016/j.ins.2020.08.121
Anton V. Ushakov , Igor Vasilyev

The k-medoids (k-median) problem is one of the best known unsupervised clustering problems. Due to its complexity, finding high-quality solutions for huge-scale datasets remains extremely challenging. The application of many approaches finding optimal or quality solutions is limited to only small and medium-size instances. On the other hand, many parallel, distributed algorithms that can handle huge-scale datasets usually provide very poor solutions. In this paper, we develop a first parallel, distributed primal–dual heuristic algorithm for the k-medoids problem. Its main component is a very efficient parallel subgradient column generation that solves a Lagrangian dual problem and finds a tight bound on solution quality. High-quality solutions are then produced by a parallel core selection technique. We considerably reduce computational burden and memory load by employing a nearest neighbor strategy to approximate the dissimilarity matrix. We demonstrate that our algorithm finds very close to optimal solutions, confirmed by the tightness of dual bounds, of instances that are much larger than those considered in the literature to date. Our experiments include clustering large-scale collections of face images into several thousand of clusters. We show that our approach outperforms parallel improved versions of the most popular k-medoids clustering algorithms, achieving nearly linear parallel speedup.

中文翻译：

接近最优的大规模k-medoids聚类

k型（k中值）问题是最著名的无监督聚类问题之一。由于其复杂性，为大规模数据集寻找高质量的解决方案仍然极具挑战性。寻找最佳或优质解决方案的许多方法的应用仅限于中小型实例。另一方面，许多可以处理大规模数据集的并行，分布式算法通常提供非常差的解决方案。在本文中，我们开发了第一个并行的，分布的原始对偶启发式算法来解决k-medoids问题。它的主要组成部分是非常高效的并行次梯度色谱柱生成，它可以解决拉格朗日对偶问题并在解决方案质量上找到严格的界限。然后通过并行核心选择技术产生高质量的解决方案。通过采用最近邻策略来近似相异矩阵，我们大大减少了计算负担和内存负载。我们证明了，我们的算法找到了非常接近最优解的情况，这被双重约束的紧密性所证实，其实例比迄今为止文献中所考虑的实例大得多。我们的实验包括将大规模的人脸图像集合聚集成数千个聚类。我们证明了我们的方法优于最流行的k-medoids聚类算法的并行改进版本，实现了近乎线性的并行加速。比迄今为止文献中考虑的实例大得多的实例。我们的实验包括将大规模的人脸图像集合聚集成数千个聚类。我们证明了我们的方法优于最流行的k-medoids聚类算法的并行改进版本，实现了几乎线性的并行加速。比迄今为止文献中考虑的实例大得多的实例。我们的实验包括将大规模的人脸图像集合聚集成数千个聚类。我们证明了我们的方法优于最流行的k-medoids聚类算法的并行改进版本，实现了几乎线性的并行加速。

更新日期：2020-09-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11