Greedy clustering of count data through a mixture of multinomial PCA,Computational Statistics

当前位置： X-MOL 学术 › Comput. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Greedy clustering of count data through a mixture of multinomial PCA
Computational Statistics ( IF 1.0 ) Pub Date : 2020-07-08 , DOI: 10.1007/s00180-020-01008-9
Nicolas Jouvin , Pierre Latouche , Charles Bouveyron , Guillaume Bataillon , Alain Livartowski

Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve performance and stability. In this paper, we rely on the mixture of multinomial PCA, a mixture model for the clustering of count data, also known as the probabilistic clustering-projection model in the literature. Related to the latent Dirichlet allocation model, it offers the flexibility of topic modeling while being able to assign each observation to a unique cluster. We introduce a greedy clustering algorithm, where inference and clustering are jointly done by mixing a classification variational expectation maximization algorithm, with a branch & bound like strategy on a variational lower bound. An integrated classification likelihood criterion is derived for model selection, and a thorough study with numerical experiments is proposed to assess both the performance and robustness of the method. Finally, we illustrate the qualitative interest of the latter in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.

中文翻译：

通过混合多项式PCA贪婪地聚集计数数据

随着数据集在大小和维度上的增长，计数数据在各种应用中变得越来越普遍。在这种情况下，越来越多的工作致力于直接解释数据离散性的统计模型的构建。而且，已经表明将降维集成到聚类中可以极大地提高性能和稳定性。在本文中，我们依赖于多项式PCA的混合，这是一种用于计数数据聚类的混合模型，在文献中也称为概率聚类投影模型。与潜在的Dirichlet分配模型相关，它提供了主题建模的灵活性同时能够将每个观察值分配给唯一的群集。我们引入一个贪婪聚类算法，该算法通过将分类变分期望最大化算法与变分下界上的分支定界类策略混合在一起，共同进行推理和聚类。综合的分类似然准则用于模型选择，并通过数值实验进行深入研究，以评估该方法的性能和鲁棒性。最后，我们与居里研究所的专业从业人员合作，说明了后者在实际应用中对解剖病理学医学报告进行聚类的定性兴趣。

更新日期：2020-07-09

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11