Hierarchical clustering with discrete latent variable models and the integrated classification likelihood,Advances in Data Analysis and Classification

当前位置： X-MOL 学术 › Adv. Data Anal. Classif. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hierarchical clustering with discrete latent variable models and the integrated classification likelihood
Advances in Data Analysis and Classification ( IF 1.4 ) Pub Date : 2021-04-13 , DOI: 10.1007/s11634-021-00440-z
Etienne Côme , Nicolas Jouvin , Pierre Latouche , Charles Bouveyron

Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable models (DLVMs) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the partition. Addressing the known problem of sub-optimal local maxima found by greedy hill climbing heuristics, we introduce a new hybrid algorithm based on a genetic algorithm which allows to efficiently explore the space of solutions. The resulting algorithm carefully combines and merges different solutions, and allows the joint inference of the number K of clusters as well as the clusters themselves. Starting from this natural partition, the second step of the methodology is based on a bottom-up greedy procedure to extract a hierarchy of clusters. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter \(\alpha \) as a regularization term controlling the granularity of the clustering. A new approximation of the criterion is derived as a log-linear function of \(\alpha \), enabling a simple functional form of the merge decision criterion. This second step allows the exploration of the clustering at coarser scales. The proposed approach is compared with existing strategies on simulated as well as real settings, and its results are shown to be particularly relevant. A reference implementation of this work is available in the R-package greed accompanying the paper.

中文翻译：

具有离散潜在变量模型的层次聚类和集成的分类可能性

查找数据集的一组嵌套分区对于发现不同规模的相关结构很有用，并且通常使用依赖于数据的方法来处理。在本文中，我们介绍了一种基于模型的层次聚类的通用两步方法。考虑到综合分类可能性标准作为目标函数，这项工作适用于该数量易于处理的每个离散潜在变量模型（DLVM）。该方法的第一步涉及相对于分区最大化标准。针对贪婪爬山启发法发现的次优局部最大值的已知问题，我们引入了一种基于遗传算法的新混合算法，该算法可有效探索解的空间。生成的算法会仔细组合并合并不同的解决方案，集群的K以及集群本身。从这个自然分区开始，该方法的第二步基于自下而上的贪婪过程，以提取集群的层次结构。在贝叶斯上下文中，这是通过将Dirichlet聚类比例先验参数\（\ alpha \）作为控制聚类粒度的正则化项来实现的。该标准的新近似值是\（\ alpha \）的对数线性函数，启用合并决策条件的简单功能形式。第二步允许在较粗的规模上探索聚类。将所提出的方法与模拟和实际环境下的现有策略进行了比较，结果表明该方法特别有用。这项工作的参考实现处于可用ř -package贪婪伴随纸张。

更新日期：2021-04-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11