当前位置: X-MOL 学术Stat. Anal. Data Min. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Out-of-bag stability estimation for k-means clustering
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2022-08-03 , DOI: 10.1002/sam.11593
Tianmou Liu 1 , Han Yu 2 , Rachael Hageman Blair 3
Affiliation  

Clustering data is a challenging problem in unsupervised learning where there is no gold standard. Results depend on several factors, such as the selection of a clustering method, measures of dissimilarity, parameters, and the determination of the number of reliable groupings. Stability has become a valuable surrogate to performance and robustness that can provide insight to an investigator on the quality of a clustering, and guidance on subsequent cluster prioritization. This work develops a framework for stability measurements that is based on resampling and OB estimation. Bootstrapping methods for cluster stability can be prone to overfitting in a setting that is analogous to poor delineation of test and training sets in supervised learning. Stability that relies on OB items from a resampling overcomes these issues and does not depend on a reference clustering for comparisons. Furthermore, OB stability can provide estimates at the level of the item, cluster, and as an overall summary, which has good interpretive value. This framework is extended to develop stability estimates for determining the number of clusters (model selection) through contrasts between stability estimates on clustered data, and stability estimates of clustered reference data with no signal. These contrasts form stability profiles that can be used to identify the largest differences in stability and do not require a direct threshold on stability values, which tend to be data specific. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network.

中文翻译:

k-means 聚类的袋外稳定性估计

在没有黄金标准的无监督学习中,聚类数据是一个具有挑战性的问题。结果取决于几个因素,例如聚类方法的选择、相异性度量、参数以及可靠分组数量的确定。稳定性已成为性能和稳健性的一个有价值的替代品,它可以为调查人员提供关于集群质量的洞察力,并为后续集群优先级排序提供指导。这项工作开发了一个基于重采样和 OB 估计的稳定性测量框架。用于集群稳定性的自举方法在类似于监督学习中测试和训练集描述不佳的设置中​​可能容易过度拟合。依赖于重采样的 OB 项目的稳定性克服了这些问题,并且不依赖于参考聚类进行比较。此外,OB稳定性可以提供item、cluster级别的估计,并作为一个整体的总结,具有很好的解释价值。通过对比聚类数据的稳定性估计与无信号的聚类参考数据的稳定性估计,该框架被扩展以开发用于确定聚类数量(模型选择)的稳定性估计。这些对比形成了稳定性概况,可用于识别稳定性的最大差异,并且不需要稳定性值的直接阈值,这往往是数据特定的。这些方法可以使用综合 R 存档网络上提供的 R 包 bootcluster 来实现。
更新日期:2022-08-03
down
wechat
bug