Clustering Categorical Data: A Survey,International Journal of Information Technology & Decision Making

当前位置： X-MOL 学术 › Int. J. Inf. Technol. Decis. Mak. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Clustering Categorical Data: A Survey
International Journal of Information Technology & Decision Making ( IF 2.5 ) Pub Date : 2019-12-10 , DOI: 10.1142/s0219622019300064
Sami Naouali ₁ , Semeh Ben Salem ₂ , Zied Chtourou ₃

Affiliation

Clustering is a complex unsupervised method used to group most similar observations of a given dataset within the same cluster. To guarantee high efficiency, the clustering process should ensure high accuracy and low complexity. Many clustering methods were developed in various fields depending on the type of application and the data type considered. Categorical clustering considers segmenting a dataset in which the data are categorical and were widely used in many real-world applications. Thus several methods were developed including hard, fuzzy and rough set-based methods. In this survey, more than 30 categorical clustering algorithms were investigated. These methods were classified into hierarchical and partitional clustering methods and classified in terms of their accuracy, precision and recall to identify the most prominent ones. Experimental results show that rough set-based clustering methods provided better efficiency than hard and fuzzy methods. Besides, methods based on the initialization of the centroids also provided good results.

中文翻译：

聚类分类数据：调查

聚类是一种复杂的无监督方法，用于将给定数据集的最相似的观察结果分组到同一聚类中。为了保证高效率，聚类过程应该保证高精度和低复杂度。根据应用程序类型和所考虑的数据类型，在各个领域开发了许多聚类方法。分类聚类考虑分割数据集，其中数据是分类的，并且在许多实际应用中被广泛使用。因此，开发了几种方法，包括基于硬集、模糊集和粗糙集的方法。在本次调查中，研究了 30 多种分类聚类算法。这些方法分为层次聚类方法和分区聚类方法，并根据其准确性、精确度和召回率进行分类，以识别最突出的方法。实验结果表明，基于粗糙集的聚类方法比硬模糊方法提供了更好的效率。此外，基于质心初始化的方法也提供了很好的结果。

更新日期：2019-12-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11