当前位置: X-MOL 学术IEEE/ACM Trans. Comput. Biol. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Statistical Analysis of Microarray Data Clustering using NMF, Spectral Clustering, Kmeans, and GMM
IEEE/ACM Transactions on Computational Biology and Bioinformatics ( IF 3.6 ) Pub Date : 2020-09-21 , DOI: 10.1109/tcbb.2020.3025486
Andri Mirzal 1
Affiliation  

In unsupervised learning literature, the study of clustering using microarray gene expression datasets has been extensively conducted with nonnegative matrix factorization (NMF), spectral clustering, kmeans, and gaussian mixture model (GMM)are some of the most used methods. However, there is still a limited number of works that utilize statistical analysis to measure the significances of performance differences between these methods. In this paper, statistical analysis of performance differences between ten NMF, six spectral clustering, four GMM, and the standard kmeans algorithms in clustering eleven publicly available microarray gene expression datasets with the number of clusters ranges from two to ten is presented. The experimental results show that statistically NMFs and kmeans have similar performances and outperform spectral clustering. As spectral clustering can be used to uncover hidden manifold structures, the underperformance of spectral methods leads us to question whether the datasets have manifold structures. Visual inspection using multidimensional scaling plots indicates that such structures do not exist. Moreover, as the plots indicate that clusters in some datasets have elliptical boundaries, GMM methods are also utilized. The experimental results show that GMM methods outperform the other methods to some degree, and thus imply that the datasets follow gaussian distributions.

中文翻译:

使用 NMF、光谱聚类、Kmeans 和 GMM 进行微阵列数据聚类的统计分析

在无监督学习文献中,使用微阵列基因表达数据集的聚类研究已被广泛开展,其中非负矩阵分解 (NMF)、谱聚类、kmeans 和高斯混合模型 (GMM) 是一些最常用的方法。然而,仍然有数量有限的作品利用统计分析来衡量这些方法之间性能差异的重要性。在本文中,对 10 个 NMF、6 个光谱聚类、4 个 GMM 和标准 kmeans 算法在聚类 11 个可公开获得的微阵列基因表达数据集(聚类数量从 2 到 10 之间)之间的性能差异进行统计分析。实验结果表明,统计上的 NMF 和 kmeans 具有相似的性能并且优于谱聚类。由于谱聚类可用于揭示隐藏的流形结构,谱方法的性能不佳导致我们质疑数据集是否具有流形结构。使用多维比例图进行目视检查表明不存在这种结构。此外,由于图表表明某些数据集中的聚类具有椭圆边界,因此也使用了 GMM 方法。实验结果表明,GMM 方法在一定程度上优于其他方法,因此意味着数据集遵循高斯分布。由于这些图表明某些数据集中的聚类具有椭圆边界,因此也使用了 GMM 方法。实验结果表明,GMM 方法在一定程度上优于其他方法,因此意味着数据集遵循高斯分布。由于这些图表明某些数据集中的聚类具有椭圆边界,因此也使用了 GMM 方法。实验结果表明,GMM 方法在一定程度上优于其他方法,因此意味着数据集遵循高斯分布。
更新日期:2020-09-21
down
wechat
bug