Deep mixtures of unigrams for uncovering topics in textual data,Statistics and Computing

当前位置： X-MOL 学术 › Stat. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep mixtures of unigrams for uncovering topics in textual data
Statistics and Computing ( IF 2.2 ) Pub Date : 2021-03-03 , DOI: 10.1007/s11222-020-09989-9
Cinzia Viroli , Laura Anderlucci

Mixtures of unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight into the grouping structure. In this work, we developed a deep version of mixtures of unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behavior of the deep mixtures of unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely k-means with cosine distance, k-means with Euclidean distance on data transformed according to semantic analysis, partition around medoids, mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward’s method with cosine dissimilarity, latent Dirichlet allocation, mixtures of unigrams estimated via the EM algorithm, spectral clustering and affinity propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy.

中文翻译：

字母组合的深层混合，可用于发现文本数据中的主题

字母组合的混合物是用于对文本数据进行聚类的最简单，最有效的工具之一，因为它们假定与同一主题相关的文档具有相似的术语分布，可以自然地由多项式描述。当分类任务特别具有挑战性时，例如当文档术语矩阵是高维且极为稀疏时，使用更多复合表示可以更好地了解分组结构。在这项工作中，我们通过允许具有更深潜在层的模型，开发出了字母组合混合的深版本，用于对带有大量术语的超短文档进行无监督分类。该建议是在贝叶斯框架中得出的。凭经验将字母组合的深层混合物的行为与其他传统的和最先进的方法进行比较，即ķ与余弦距离-means，ķ利用根据根据Ward的方法与余弦相异，潜狄利克雷分配，混合物语义分析，围绕中心点分区，上基于语义变换数据高斯混合物，层次聚类转化上的数据的欧氏距离-means通过EM算法，频谱聚类和亲和力传播聚类估计的字母组合数。根据正确的分类率和调整的兰德指数对性能进行评估。仿真研究和实际数据分析表明，深入研究此类数据的聚类可以极大地提高分类的准确性。

更新日期：2021-03-03

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>