kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning,Big Data Research

当前位置： X-MOL 学术 › Big Data Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning
Big Data Research ( IF 3.5 ) Pub Date : 2018-06-01 , DOI: 10.1016/j.bdr.2018.05.003
Hossein Estiri , Behzad Abounia Omran , Shawn N. Murphy

The majority of the clinical observation data stored in large-scale Electronic Health Record (EHR) research data networks are unlabeled. Unsupervised clustering can provide invaluable tools for studying patient sub-groups in these data. Many of the popular unsupervised clustering algorithms are dependent on identifying the number of clusters. Multiple statistical methods are available to approximate the number of clusters in a dataset. However, available methods are computationally inefficient when applied to large amounts of data. Scalable analytical procedures are needed to extract knowledge from large clinical datasets. Using both simulated, clinical, and public data, we developed and tested the kluster procedure for approximating the number of clusters in a large clinical dataset. The kluster procedure iteratively applies four statistical cluster number approximation methods to small subsets of data that were drawn randomly with replacements and recommends the most frequent and mean number of clusters resulted from the iterations as the potential optimum number of clusters. Our results showed that the kluster's most frequent product that iteratively applies a model-based clustering strategy using Bayesian Information Criterion (BIC) to samples of 200–500 data points, through 100 iterations, offers a reliable and scalable solution for approximating the number of clusters in unsupervised clustering. We provide the kluster procedure as an R package.

中文翻译：

kluster：一种有效的可扩展过程，用于在无监督学习中近似群集数

大型电子健康记录（EHR）研究数据网络中存储的大多数临床观察数据均未标记。无监督聚类可以为研究这些数据中的患者亚组提供宝贵的工具。许多流行的无监督聚类算法都依赖于识别聚类的数量。有多种统计方法可用于估算数据集中的聚类数量。但是，可用方法在应用于大量数据时在计算上效率低下。需要可扩展的分析程序来从大型临床数据集中提取知识。使用模拟，临床和公共数据，我们开发并测试了kluster程序，用于近似大型临床数据集中的簇数。该kluster该过程将四种统计簇数逼近方法迭代应用于替换随机抽取的小数据子集，并建议将迭代产生的最频繁和平均的簇数作为潜在的最佳簇数。我们的结果表明，kluster最频繁使用乘以贝叶斯信息准则（BIC）将基于模型的聚类策略应用于200-500个数据点的样本（通过100次迭代）的迭代产品，提供了一种可靠且可扩展的解决方案，用于近似集群数在无监督的集群中。我们以R包的形式提供kluster过程。

更新日期：2018-06-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文