当前位置: X-MOL 学术Pattern Anal. Applic. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretability
Pattern Analysis and Applications ( IF 3.9 ) Pub Date : 2021-05-12 , DOI: 10.1007/s10044-021-00977-x
Seyed Alireza Mousavian Anaraki , Abdorrahman Haeri , Fateme Moslehi

The K-means algorithm is a popular clustering method, which is sensitive to the initialization of samples and selecting the number of clusters. Its performance on high-dimensional datasets is considerably influenced. Principal component analysis (PCA) is a linear dimensionless reduction method that is closely related to the K-means algorithm. Dimension reduction leads to the selection of initial centers in a smaller space, which is a solution to solve initialization problems. The present study investigates the reciprocal relationship between K-means and PCA and adopts an innovative approach of creating sub-datasets and applying step-by-step labeling in the hybrid execution of both algorithms to propose two methods, namely K-P and P-K. The clusters that are obtained from the two proposed methods are of high interpretability. This was verified by the step-by-step labeling results of a human resource dataset. Interpretability was evaluated via the distribution of features of interest (FoI), suggesting improved results for both datasets. In addition to the improvement of the qualitative results, the outcome of the present study showed the sum of squared estimate of errors (SSE)/N (total number of data) and silhouette improvement of 10 datasets with eight initialization methods in previous studies. The P-K results and run time were better than the K-P ones.



中文翻译:

PCA和K-means的混合倒数模型,采用创新的方法来考虑子数据集,以改善K-means的初始化和逐步标记,以创建具有较高解释性的聚类

K-means算法是一种流行的聚类方法,它对样本的初始化和选择聚类的数量很敏感。它在高维数据集上的性能受到很大影响。主成分分析(PCA)是一种线性无量纲约简方法,与K-means算法密切相关。降维导致在较小空间中选择初始中心,这是解决初始化问题的一种解决方案。本研究调查了K-均值和PCA之间的相互关系,并采用了一种创新的方法来创建子数据集并在两种算法的混合执行中应用逐步标记,以提出两种方法,即KP和PK。从两种提出的方​​法获得的聚类具有很高的解释性。人力资源数据集的分步标记结果对此进行了验证。通过关注特征(FoI)的分布评估了可解​​释性,表明两个数据集的结果都得到了改善。除了改进定性结果外,本研究的结果还显示了先前研究中使用八种初始化方法对10个数据集进行的误差平方估计(SSE)/ N(数据总数)的平方和之和。PK结果和运行时间优于KP。本研究的结果显示,在以前的研究中,使用8种初始化方法对10个数据集的误差平方估计值(SSE)/ N(数据总数)和轮廓改善的总和。PK结果和运行时间优于KP。本研究的结果显示,在以前的研究中,使用8种初始化方法对10个数据集的误差平方估计值(SSE)/ N(数据总数)和轮廓改善的总和。PK结果和运行时间优于KP。

更新日期:2021-05-12
down
wechat
bug