当前位置: X-MOL 学术Appl. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A dissimilarity measure for mixed nominal and ordinal attribute data in k-Modes algorithm
Applied Intelligence ( IF 3.4 ) Pub Date : 2020-01-25 , DOI: 10.1007/s10489-019-01583-5
Fang Yuan , Youlong Yang , Tiantian Yuan

Among the existing clustering algorithms, the k-Means algorithm is one of the most commonly used clustering methods. As an extension of the k-Means algorithm, the k-Modes algorithm has been widely applied to categorical data clustering by replacing means with modes. However, there are more mixed-type data containing categorical, ordinal and numerical attributes. Mixed-type data clustering problem has recently attracted much attention from the data mining research community, but most of them fail to notice the ordinal attributes and establish explicit metric similarity of ordinal attributes. In this paper, the limitations of some existing dissimilarity measure of k-Modes algorithm in mixed ordinal and nominal data are analyzed by using some illustrative examples. Based on the idea of mining ordinal information of ordinal attribute, a new dissimilarity measure for the k-Modes algorithm to cluster this type of data is proposed. The distinct characteristic of the new dissimilarity measure is to take account of the ordinal information of ordinal attribute. A convergence study and time complexity of the k-Modes algorithm based on this new dissimilarity measure indicates that it can be effectively used for large data sets. The results of comparative experiments on nine real data sets from UCI show the effectiveness of the new dissimilarity measure.



中文翻译:

k模式算法中混合的名义和有序属性数据的相异性度量

在现有的聚类算法中,k-Means算法是最常用的聚类方法之一。作为k-Means算法的扩展,k-Modes算法通过用模式替换手段,已广泛应用于分类数据聚类。但是,还有更多的混合类型数据包含分类,序数和数字属性。混合类型的数据聚类问题最近引起了数据挖掘研究界的广泛关注,但是其中大多数未能注意到序数属性并无法建立序数属性的显式度量相似性。在本文中,通过一些示例性实例分析了k-Modes算法在混合序数和名义数据中现有的一些不相似性度量的局限性。基于挖掘有序属性的有序信息的思想,提出了一种新的k模算法的相似度度量,以对此类数据进行聚类。新的相异性度量的独特特征是要考虑序数属性的序数信息。基于这种新的相异性度量的k-Modes算法的收敛性研究和时间复杂度表明,它可以有效地用于大型数据集。对UCI的9个真实数据集进行的比较实验结果表明,这种新的相异性度量方法是有效的。基于这种新的相异性度量的k-Modes算法的收敛性研究和时间复杂度表明,它可以有效地用于大型数据集。对UCI的9个真实数据集进行的比较实验结果表明,这种新的相异性度量方法是有效的。基于这种新的相异性度量的k-Modes算法的收敛性研究和时间复杂度表明,它可以有效地用于大型数据集。对UCI的9个真实数据集进行的比较实验结果表明,这种新的相异性度量方法是有效的。

更新日期:2020-04-20
down
wechat
bug