当前位置: X-MOL 学术Biometrics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Nonparametric cluster significance testing with reference to a unimodal null distribution
Biometrics ( IF 1.9 ) Pub Date : 2020-09-23 , DOI: 10.1111/biom.13376
Erika S Helgeson 1 , David M Vock 1 , Eric Bair 2
Affiliation  

Cluster analysis is an unsupervised learning strategy that is exceptionally useful for identifying homogeneous subgroups of observations in data sets of unknown structure. However, it is challenging to determine if the identified clusters represent truly distinct subgroups rather than noise. Existing approaches for addressing this problem tend to define clusters based on distributional assumptions, ignore the inherent correlation structure in the data, or are not suited for high-dimension low-sample size (HDLSS) settings. In this paper, we propose a novel method to evaluate the significance of identified clusters by comparing the explained variation due to the clustering from the original data to that produced by clustering a unimodal reference distribution that preserves the covariance structure in the data. The reference distribution is generated using kernel density estimation, and thus, does not require that the data follow a particular distribution. By utilizing sparse covariance estimation, the method is adapted for the HDLSS setting. The approach can be used to test the null hypothesis that the data cannot be partitioned into clusters and to determine the optimal number of clusters. Simulation examples, theoretical evaluations, and applications to temporomandibular disorder research and cancer microarray data illustrate the utility of the proposed method.

中文翻译:

参考单峰零分布的非参数聚类显着性检验

聚类分析是一种无监督学习策略,对于在未知结构的数据集中识别观察的同质子组特别有用。然而,确定所识别的集群是否代表真正不同的子组而不是噪声是具有挑战性的。解决此问题的现有方法倾向于基于分布假设定义聚类,忽略数据中固有的相关结构,或者不适合高维低样本量 (HDLSS) 设置。在本文中,我们提出了一种新方法来评估已识别聚类的重要性,方法是将原始数据聚类引起的解释变化与通过聚类保留数据协方差结构的单峰参考分布产生的变化进行比较。参考分布是使用核密度估计生成的,因此不需要数据遵循特定分布。通过利用稀疏协方差估计,该方法适用于 HDLSS 设置。该方法可用于检验无法将数据划分为聚类的原假设,并确定最佳聚类数。模拟示例、理论评估以及在颞下颌关节紊乱研究和癌症微阵列数据中的应用说明了所提出方法的实用性。该方法可用于检验无法将数据划分为聚类的原假设,并确定最佳聚类数。模拟示例、理论评估以及在颞下颌关节紊乱研究和癌症微阵列数据中的应用说明了所提出方法的实用性。该方法可用于检验无法将数据划分为聚类的原假设,并确定最佳聚类数。模拟示例、理论评估以及在颞下颌关节紊乱研究和癌症微阵列数据中的应用说明了所提出方法的实用性。
更新日期:2020-09-23
down
wechat
bug