Improving clustering performance using independent component analysis and unsupervised feature learning,Human-centric Computing and Information Sciences

当前位置： X-MOL 学术 › Hum. Cent. Comput. Inf. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improving clustering performance using independent component analysis and unsupervised feature learning
Human-centric Computing and Information Sciences ( IF 3.9 ) Pub Date : 2018-08-23 , DOI: 10.1186/s13673-018-0148-3
Eren Gultepe , Masoud Makrehchi

Objective

To provide a parsimonious clustering pipeline that provides comparable performance to deep learning-based clustering methods, but without using deep learning algorithms, such as autoencoders.

Materials and methods

Clustering was performed on six benchmark datasets, consisting of five image datasets used in object, face, digit recognition tasks (COIL20, COIL100, CMU-PIE, USPS, and MNIST) and one text document dataset (REUTERS-10K) used in topic recognition. K-means, spectral clustering, Graph Regularized Non-negative Matrix Factorization, and K-means with principal components analysis algorithms were used for clustering. For each clustering algorithm, blind source separation (BSS) using Independent Component Analysis (ICA) was applied. Unsupervised feature learning (UFL) using reconstruction cost ICA (RICA) and sparse filtering (SFT) was also performed for feature extraction prior to the cluster algorithms. Clustering performance was assessed using the normalized mutual information and unsupervised clustering accuracy metrics.

Results

Performing, ICA BSS after the initial matrix factorization step provided the maximum clustering performance in four out of six datasets (COIL100, CMU-PIE, MNIST, and REUTERS-10K). Applying UFL as an initial processing component helped to provide the maximum performance in three out of six datasets (USPS, COIL20, and COIL100). Compared to state-of-the-art non-deep learning clustering methods, ICA BSS and/or UFL with graph-based clustering algorithms outperformed all other methods. With respect to deep learning-based clustering algorithms, the new methodology presented here obtained the following rankings: COIL20, 2nd out of 5; COIL100, 2nd out of 5; CMU-PIE, 2nd out of 5; USPS, 3rd out of 9; MNIST, 8th out of 15; and REUTERS-10K, 4th out of 5.

Discussion

By using only ICA BSS and UFL using RICA and SFT, clustering accuracy that is better or on par with many deep learning-based clustering algorithms was achieved. For instance, by applying ICA BSS to spectral clustering on the MNIST dataset, we obtained an accuracy of 0.882. This is better than the well-known Deep Embedded Clustering algorithm that had obtained an accuracy of 0.818 using stacked denoising autoencoders in its model.

Conclusion

Using the new clustering pipeline presented here, effective clustering performance can be obtained without employing deep clustering algorithms and their accompanying hyper-parameter tuning procedure.

中文翻译：

使用独立组件分析和无监督特征学习来提高聚类性能

目的

提供一个简约的群集管道，可提供与基于深度学习的群集方法相当的性能，但不使用诸如自动编码器之类的深度学习算法。

材料和方法

对六个基准数据集进行聚类，其中包括用于对象，面部，数字识别任务（COIL20，COIL100，CMU-PIE，USPS和MNIST）的五个图像数据集和用于主题识别的一个文本文档数据集（REUTERS-10K）。将K均值，频谱聚类，图正则化非负矩阵分解和具有主成分分析算法的K均值用于聚类。对于每种聚类算法，都应用了使用独立成分分析（ICA）的盲源分离（BSS）。在聚类算法之前，还使用重建成本ICA（RICA）和稀疏滤波（SFT）进行了无监督特征学习（UFL），以进行特征提取。使用归一化的互信息和无监督的聚类准确性指标评估聚类性能。

结果

在初始矩阵分解步骤之后，ICA BSS在六个数据集中的四个数据集（COIL100，CMU-PIE，MNIST和REUTERS-10K）中提供了最大的聚类性能。将UFL用作初始处理组件有助于在六个数据集（USPS，COIL20和COIL100）中的三个数据集中提供最佳性能。与最新的非深度学习聚类方法相比，具有基于图的聚类算法的ICA BSS和/或UFL优于所有其他方法。关于基于深度学习的聚类算法，此处介绍的新方法获得了以下排名：COIL20，5中排名第二； COIL20，5中排名第二。COIL100，5分之二；CMU-PIE，5分之二；美国邮政总局（USPS），在9中排名第三；MNIST，在15中排名第8；和REUTERS-10K，在5中排名第4。

讨论区

通过仅使用ICA BSS和使用RICA和SFT的UFL，可以实现更好的聚类精度，或与许多基于深度学习的聚类算法相媲美。例如，通过将ICA BSS应用于MNIST数据集上的光谱聚类，我们获得了0.882的精度。这好于众所周知的深度嵌入式群集算法，该算法在其模型中使用堆叠降噪自动编码器获得了0.818的精度。