An empirical comparison between stochastic and deterministic centroid initialisation for K-means variations,Machine Learning

当前位置： X-MOL 学术 › Mach. Learn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An empirical comparison between stochastic and deterministic centroid initialisation for K-means variations
Machine Learning ( IF 4.3 ) Pub Date : 2021-07-12 , DOI: 10.1007/s10994-021-06021-7
Avgoustinos Vouros ₁ , Eleni Vasilaki ₁ , Stephen Langdell ₂ , Mike Croucher ₂

Affiliation

K-Means is one of the most used algorithms for data clustering and the usual clustering method for benchmarking. Despite its wide application it is well-known that it suffers from a series of disadvantages; it is only able to find local minima and the positions of the initial clustering centres (centroids) can greatly affect the clustering solution. Over the years many K-Means variations and initialisation techniques have been proposed with different degrees of complexity. In this study we focus on common K-Means variations along with a range of deterministic and stochastic initialisation techniques. We show that, on average, more sophisticated initialisation techniques alleviate the need for complex clustering methods. Furthermore, deterministic methods perform better than stochastic methods. However, there is a trade-off: less sophisticated stochastic methods, executed multiple times, can result in better clustering. Factoring in execution time, deterministic methods can be competitive and result in a good clustering solution. These conclusions are obtained through extensive benchmarking using a range of synthetic model generators and real-world data sets.

中文翻译：

K 均值变化的随机和确定性质心初始化之间的经验比较

K-Means 是最常用的数据聚类算法之一，也是用于基准测试的常用聚类方法。尽管它应用广泛，但众所周知它有一系列缺点；它只能找到局部最小值，初始聚类中心（质心）的位置会极大地影响聚类解决方案。多年来，已经提出了许多具有不同复杂度的 K-Means 变体和初始化技术。在这项研究中，我们关注常见的 K-Means 变化以及一系列确定性和随机初始化技术。我们表明，平均而言，更复杂的初始化技术减轻了对复杂聚类方法的需求。此外，确定性方法比随机方法表现更好。但是，有一个权衡：不太复杂的随机方法，执行多次，可以产生更好的聚类。考虑到执行时间，确定性方法可能具有竞争力并产生良好的聚类解决方案。这些结论是通过使用一系列合成模型生成器和真实世界数据集的广泛基准测试得出的。

更新日期：2021-07-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11