Hierarchical Qualitative Clustering: clustering mixed datasets with critical qualitative information,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hierarchical Qualitative Clustering: clustering mixed datasets with critical qualitative information
arXiv - CS - Machine Learning Pub Date : 2020-06-30 , DOI: arxiv-2006.16701
Diogo Seca, Jo\~ao Mendes-Moreira, Tiago Mendes-Neves, Ricardo Sousa

Clustering can be used to extract insights from data or to verify some of the assumptions held by the domain experts, namely data segmentation. In the literature, few methods can be applied in clustering qualitative values using the context associated with other variables present in the data, without losing interpretability. Moreover, the metrics for calculating dissimilarity between qualitative values often scale poorly for high dimensional mixed datasets. In this study, we propose a novel method for clustering qualitative values, based on Hierarchical Clustering (HQC), and using Maximum Mean Discrepancy. HQC maintains the original interpretability of the qualitative information present in the dataset. We apply HQC to two datasets. Using a mixed dataset provided by Spotify, we showcase how our method can be used for clustering music artists based on the quantitative features of thousands of songs. In addition, using financial features of companies, we cluster company industries, and discuss the implications in investment portfolios diversification.

中文翻译：

分层定性聚类：使用关键定性信息对混合数据集进行聚类

聚类可用于从数据中提取见解或验证领域专家持有的一些假设，即数据分割。在文献中，很少有方法可以使用与数据中存在的其他变量相关联的上下文来对定性值进行聚类，而不会失去可解释性。此外，用于计算定性值之间差异的指标通常对于高维混合数据集的缩放效果不佳。在这项研究中，我们提出了一种基于分层聚类 (HQC) 并使用最大平均差异来聚类定性值的新方法。HQC 保持数据集中存在的定性信息的原始可解释性。我们将 HQC 应用于两个数据集。使用 Spotify 提供的混合数据集，我们展示了我们的方法如何用于基于数千首歌曲的定量特征对音乐艺术家进行聚类。此外，利用公司的财务特征，我们对公司行业进行集群，并讨论对投资组合多元化的影响。

更新日期：2020-07-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文