Clustering of modal-valued symbolic data,Advances in Data Analysis and Classification

当前位置： X-MOL 学术 › Adv. Data Anal. Classif. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Clustering of modal-valued symbolic data
Advances in Data Analysis and Classification ( IF 1.6 ) Pub Date : 2020-10-24 , DOI: 10.1007/s11634-020-00425-4
Nataša Kejžar , Simona Korenjak-Černe , Vladimir Batagelj

Symbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of all measurement types during the clustering process. In this paper, we present the theoretical basis for compatible leaders and agglomerative clustering methods with alternative dissimilarities for modal-valued SOs. The leaders method efficiently solves clustering problems with large numbers of units, while the agglomerative method can be applied either alone to a small data set, or to leaders, obtained from the compatible leaders clustering method. We focus on (a) the inclusion of weights that enables clustering representatives to retain the same structure as if clustering only first order units and (b) the selection of relative dissimilarities that produce more interpretable, i.e., meaningful optimal clustering representatives. The usefulness of the proposed methods with adaptations was assessed and substantiated by carefully constructed simulation settings and demonstrated on three different real-world data sets gaining in interpretability from the use of weights (population pyramids and ESS data) or relative dissimilarity (US patents data).

中文翻译：

模态值符号数据的聚类

符号数据分析基于对称为符号对象（SO）的数据的特殊描述。与具有平均值的通常表示相比，此类描述保留了有关单元及其簇的更详细的信息。SO的一种特殊类型是具有频率或概率分布（模态值）的表示形式。这种表示使我们能够在聚类过程中同时考虑所有度量类型的变量。在本文中，我们为模态值SO提供了具有替代相似性的兼容领导者和聚集聚类方法的理论基础。Leaders方法可以有效解决具有大量单元的聚类问题，而凝聚方法既可以单独应用于少量数据集，也可以应用于从兼容的Leaders聚类方法中获得的Leaders。我们专注于（a）包括权重，这些权重使聚类的代表能够像保留仅一阶单位一样保留相同的结构，以及（b）选择产生更多可解释的相对不相似性（即有意义的最佳聚类的代表）的权重。通过精心构建的模拟设置评估并证实了所提出的方法的有用性，并通过权重（人口金字塔和ESS数据）或相对相异性（美国专利数据）的使用获得了可解释性的三个不同的真实世界数据集上的证明。。有意义的最佳聚类代表。通过精心构建的模拟设置评估并证实了所提出的方法的有用性，并通过权重（人口金字塔和ESS数据）或相对相异性（美国专利数据）的使用获得了可解释性的三个不同的真实世界数据集上的证明。。有意义的最佳聚类代表。通过精心构建的模拟设置评估并证实了所提出的方法的有用性，并通过权重（人口金字塔和ESS数据）或相对相异性（美国专利数据）的使用获得了可解释性的三个不同的真实世界数据集上的证明。。

更新日期：2020-10-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>