Textual data summarization using the Self-Organized Co-Clustering model,Pattern Recognition

当前位置： X-MOL 学术 › Pattern Recogn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Textual data summarization using the Self-Organized Co-Clustering model
Pattern Recognition ( IF 7.5 ) Pub Date : 2020-07-01 , DOI: 10.1016/j.patcog.2020.107315
Margot Selosse , Julien Jacques , Christophe Biernacki

Recently, different studies have demonstrated the use of co-clustering, a data mining technique which simultaneously produces row-clusters of observations and column-clusters of features. The present work introduces a novel co-clustering model to easily summarize textual data in a document-term format. In addition to highlighting homogeneous co-clusters as other existing algorithms do we also distinguish noisy co-clusters from significant co-clusters, which is particularly useful for sparse document-term matrices. Furthermore, our model proposes a structure among the significant co-clusters, thus providing improved interpretability to users. The approach proposed contends with state-of-the-art methods for document and term clustering and offers user-friendly results. The model relies on the Poisson distribution and on a constrained version of the Latent Block Model, which is a probabilistic approach for co-clustering. A Stochastic Expectation-Maximization algorithm is proposed to run the model’s inference as well as a model selection criterion to choose the number of coclusters. Both simulated and real data sets illustrate the eciency of this model by its ability to easily identify relevant co-clusters.

中文翻译：

使用自组织协同聚类模型的文本数据摘要

最近，不同的研究已经证明了协同聚类的使用，这是一种数据挖掘技术，可以同时产生观察的行聚类和特征的列聚类。目前的工作引入了一种新颖的共聚类模型，可以轻松地以文档术语格式汇总文本数据。除了像其他现有算法一样突出同类协同集群之外，我们还将嘈杂的协同集群与重要的协同集群区分开来，这对于稀疏文档项矩阵特别有用。此外，我们的模型在重要的协同集群之间提出了一种结构，从而为用户提供了更好的可解释性。所提出的方法与最先进的文档和术语聚类方法相抗衡，并提供用户友好的结果。该模型依赖于泊松分布和潜在块模型的约束版本，这是一种用于协同聚类的概率方法。提出了一个随机期望最大化算法来运行模型的推理以及一个模型选择标准来选择共簇的数量。模拟数据集和真实数据集都通过其轻松识别相关协同集群的能力来说明该模型的效率。

更新日期：2020-07-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11