当前位置: X-MOL 学术Stat. Anal. Data Min. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The next-generation K-means algorithm.
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2018-05-11 , DOI: 10.1002/sam.11379
Eugene Demidenko 1
Affiliation  

Typically, when referring to a model‐based classification, the mixture distribution approach is understood. In contrast, we revive the hard‐classification model‐based approach developed by Banfield and Raftery (1993) for which K‐means is equivalent to the maximum likelihood (ML) estimation. The next‐generation K‐means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model‐based approach for the K‐means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no‐clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K‐means.

中文翻译:


下一代 K 均值算法。



通常,当提到基于模型的分类时,可以理解混合分布方法。相比之下,我们恢复了 Banfield 和 Raftery (1993) 开发的基于硬分类模型的方法,其中K均值相当于最大似然 (ML) 估计。下一代K -means算法并没有在分类完成后结束,而是继续回答以下基本问题:是否存在聚类,有多少个聚类,估计均值和指标集的统计特性是什么,聚类回归中系数的分布是什么?如何对多级数据进行分类? K均值算法的基于统计模型的方法是关键,因为它允许遵循经典统计的轨迹进行统计模拟和研究分类属性。本文阐述了 ML 分类在检验无聚类假设、研究使用模拟选择聚类数量的各种方法、使用拉普拉斯分布的鲁棒聚类、研究聚类回归中系数的属性以及最后到多层次的应用。通过将方差分量模型与K均值相结合来分析数据。
更新日期:2018-05-11
down
wechat
bug