当前位置: X-MOL 学术J. Intell. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A minimum spanning tree based partitioning and merging technique for clustering heterogeneous data sets
Journal of Intelligent Information Systems ( IF 2.3 ) Pub Date : 2020-04-22 , DOI: 10.1007/s10844-020-00602-z
Gaurav Mishra , Sraban Kumar Mohanty

Clustering being an unsupervised learning technique, has been used extensively for knowledge discovery due to its less dependency on domain knowledge. Many clustering techniques were proposed in the literature to recognize the cluster of different characteristics. Most of them become inadequate either due to their dependency on user-defined parameters or when they are applied on multi-scale datasets. Hybrid clustering techniques have been proposed to take the advantage of both Partitional and Hierarchical techniques by first partitioning the dataset into several dense sub-clusters and merging them into actual clusters. However, the universality of the partition and merging criteria are not sufficient to capture many characteristics of the clusters. Minimum spanning tree (MST) has been used extensively for clustering because it preserves the intrinsic nature of the dataset even after the sparsification of the graph. In this paper, we propose a parameter-free, minimum spanning tree based efficient hybrid clustering algorithm to cluster the multi-scale datasets. In the first phase, we construct a MST of the dataset to capture the neighborhood information of data points and employ box-plot, an outlier detection technique on MST edge weights for effectively selecting the inconsistent edges to partition the data points into several dense sub-clusters. In the second phase, we propose a novel merging criterion to find the genuine clusters by iteratively merging only the pairs of adjacent sub-clusters. The merging technique involves both dis-connectivity and intra-similarity using the topology of two adjacent pairs which helps to identify the arbitrary shape and varying density clusters. Experiment results on various synthetic and real world datasets demonstrate the superior performance of the proposed technique over other popular clustering algorithms.

中文翻译:

一种用于异构数据集聚类的基于最小生成树的分区和合并技术

聚类是一种无监督学习技术,由于其对领域知识的依赖性较小,因此已被广泛用于知识发现。文献中提出了许多聚类技术来识别不同特征的聚类。由于它们依赖于用户定义的参数或当它们应用于多尺度数据集时,它们中的大多数变得不合适。已经提出了混合聚类技术来利用分区和分层技术的优势,首先将数据集划分为几个密集的子集群并将它们合并到实际的集群中。然而,划分和合并标准的普遍性不足以捕捉集群的许多特征。最小生成树 (MST) 已被广泛用于聚类,因为即使在图形稀疏化之后,它也能保留数据集的内在性质。在本文中,我们提出了一种基于无参数、最小生成树的高效混合聚类算法来对多尺度数据集进行聚类。在第一阶段,我们构建数据集的 MST 来捕获数据点的邻域信息,并使用箱线图,一种基于 MST 边缘权重的异常值检测技术,用于有效地选择不一致的边缘将数据点划分为几个密集子集群。在第二阶段,我们提出了一种新的合并标准,通过仅迭代合并相邻子集群对来找到真正的集群。合并技术涉及使用两个相邻对的拓扑结构的断开连接和内部相似性,这有助于识别任意形状和不同密度的集群。在各种合成和真实世界数据集上的实验结果证明了所提出的技术优于其他流行的聚类算法的性能。
更新日期:2020-04-22
down
wechat
bug