Simultaneous estimation of cluster number and feature sparsity in high-dimensional cluster analysis,Biometrics

当前位置： X-MOL 学术 › Biometrics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Simultaneous estimation of cluster number and feature sparsity in high-dimensional cluster analysis
Biometrics ( IF 1.4 ) Pub Date : 2021-02-23 , DOI: 10.1111/biom.13449
Yujia Li ₁ , Xiangrui Zeng ₂ , Chien-Wei Lin ₃ , George C Tseng ₁

Affiliation

Estimating the number of clusters (K) is a critical and often difficult task in cluster analysis. Many methods have been proposed to estimate K, including some top performers using resampling approach. When performing cluster analysis in high-dimensional data, simultaneous clustering and feature selection is needed for improved interpretation and performance. To our knowledge, little has been studied for simultaneous estimation of K and feature sparsity parameter in a high-dimensional exploratory cluster analysis. In this paper, we propose a resampling method to bridge this gap and evaluate its performance under the sparse K-means clustering framework. The proposed target function balances between sensitivity and specificity of clustering evaluation of pairwise subjects from clustering of full and subsampled data. Through extensive simulations, the method performs among the best over classical methods in estimating K in low-dimensional data. For high-dimensional simulation data, it also shows superior performance to simultaneously estimate K and feature sparsity parameter. Finally, we evaluated the methods in four microarray, two RNA-seq, one SNP, and two nonomics datasets. The proposed method achieves better clustering accuracy with fewer selected predictive genes in almost all real applications.

中文翻译：

高维聚类分析中聚类数和特征稀疏度的同时估计

估计聚类的数量 ( K ) 是聚类分析中的一项关键且通常困难的任务。已经提出了许多方法来估计K，包括使用重采样方法的一些表现最好的方法。在高维数据中执行聚类分析时，需要同时进行聚类和特征选择以提高解释和性能。据我们所知，在高维探索性聚类分析中同时估计K和特征稀疏参数的研究很少。在本文中，我们提出了一种重采样方法来弥补这一差距并评估其在稀疏K下的性能- 表示聚类框架。所提出的目标函数平衡了从完整和子采样数据的聚类中对成对受试者进行聚类评估的敏感性和特异性。通过广泛的模拟，该方法在估计低维数据中的K方面表现优于经典方法。对于高维模拟数据，它还表现出同时估计K和特征稀疏参数的优越性能。最后，我们在四个微阵列、两个 RNA-seq、一个 SNP 和两个非经济学数据集中评估了这些方法。在几乎所有实际应用中，所提出的方法以更少的选择预测基因实现了更好的聚类精度。

更新日期：2021-02-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11