当前位置: X-MOL 学术Stat › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Bayesian non‐parametric approach for automatic clustering with feature weighting
Stat ( IF 0.7 ) Pub Date : 2020-08-11 , DOI: 10.1002/sta4.306
Debolina Paul 1 , Swagatam Das 2
Affiliation  

Despite being a well‐known problem, feature weighting and feature selection are a major predicament for clustering. Most of the algorithms, which provide weighting or selection of features, require the number of clusters to be known in advance. On the other hand, the existing automatic clustering procedures that can determine the number of clusters are computationally expensive and often do not make a room for feature weighting or selection. In this paper, we propose a Gibbs sampling‐based algorithm for the Dirichlet process mixture model, which can determine the number of clusters and can also incorporate a near‐optimal feature weighting. We show that in the limiting case, the algorithm approaches a hard clustering procedure, which resembles minimization of an underlying clustering objective similar to weighted k‐means with an additional forfeit for the number of clusters and hence retains the simplicity of the Llyod's heuristics. To avoid the trivial solution of the resulting linear program, we include an additional entropic penalty on the feature weights. The proposed algorithm is tested on several synthetic and real‐life datasets. Through a detailed experimental analysis, we demonstrate the competitiveness of our proposal against the baseline as well as state‐of‐the‐art procedures for centre‐based high‐dimensional clustering.

中文翻译:

具有特征权重的自动聚类的贝叶斯非参数方法

尽管存在着一个众所周知的问题,但是特征权重和特征选择仍然是聚类的主要障碍。提供特征的加权或选择的大多数算法都要求事先知道簇的数量。另一方面,可以确定聚类数量的现有自动聚类过程在计算上是昂贵的,并且通常不会为特征加权或选择留出空间。在本文中,我们为Dirichlet过程混合模型提出了一种基于Gibbs采样的算法,该算法可以确定聚类的数量,并且还可以合并近乎最佳的特征权重。我们表明,在极限情况下,该算法采用了硬聚类过程,类似于最小化类似于加权k的聚类目标-意味着对簇的数量会额外丧失,因此保留了Llyod启发式方法的简单性。为避免所得线性程序的平凡解决方案,我们在特征权重上包括附加的熵损失。该算法在多个综合和真实数据集上进行了测试。通过详细的实验分析,我们证明了我们的建议相对于基线的竞争力以及基于中心的高维聚类的最新程序。
更新日期:2020-08-11
down
wechat
bug