当前位置: X-MOL 学术Adv. Data Anal. Classif. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Robust and sparse k-means clustering for high-dimensional data
Advances in Data Analysis and Classification ( IF 1.6 ) Pub Date : 2019-03-19 , DOI: 10.1007/s11634-019-00356-9
Šárka Brodinová , Peter Filzmoser , Thomas Ortner , Christian Breiteneder , Maia Rohm

In real-world application scenarios, the identification of groups poses a significant challenge due to possibly occurring outliers and existing noise variables. Therefore, there is a need for a clustering method which is capable of revealing the group structure in data containing both outliers and noise variables without any pre-knowledge. In this paper, we propose a k-means-based algorithm incorporating a weighting function which leads to an automatic weight assignment for each observation. In order to cope with noise variables, a lasso-type penalty is used in an objective function adjusted by observation weights. We finally introduce a framework for selecting both the number of clusters and variables based on a modified gap statistic. The conducted experiments on simulated and real-world data demonstrate the advantage of the method to identify groups, outliers, and informative variables simultaneously.

中文翻译:

高维数据的鲁棒和稀疏k均值聚类

在实际的应用场景中,由于可能出现异常值和现有的噪声变量,因此对组的识别提出了巨大的挑战。因此,需要一种能够在不具有任何先验知识的情况下揭示包含离群值和噪声变量的数据中的组结构的聚类方法。在本文中,我们提出了一个ķ-基于均值的算法,结合了加权功能,可为每个观测值自动分配权重。为了应对噪声变量,在通过观测权重调整的目标函数中使用套索类型的罚分。最后,我们介绍了一个基于修改后的差距统计数据选择聚类数量和变量的框架。在模拟和真实数据上进行的实验证明了该方法的优势,可以同时识别组,离群值和信息变量。
更新日期:2019-03-19
down
wechat
bug