当前位置: X-MOL 学术Int. J. Mach. Learn. & Cyber. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Topic discovery by spectral decomposition and clustering with coordinated global and local contexts
International Journal of Machine Learning and Cybernetics ( IF 3.1 ) Pub Date : 2020-05-16 , DOI: 10.1007/s13042-020-01133-3
Jian Wang , Kejing He , Min Yang

Topic modeling is an active research field due to its broad applications such as information retrieval, opinion extraction and authorship identification. It aims to discover topic structures from a collection of documents. Significant progress have been made by the latent dirichlet allocation (LDA) and its variants. However, the “bag-of-words” assumption is usually made for the whole document by conventional methods, which ignores the semantics of local context that play crucial roles in topic modeling and document understanding. In this paper, we propose a novel coordinated embedding topic model (CETM), which incorporates spectral decomposition and clustering technique by leveraging both global and local context information to discover topics. In particular, CETM learns coordinated embeddings by using spectral decomposition, capturing the word semantic relations effectively. To infer the topic distribution, we employ a clustering algorithm to capture semantic centroids of coordinated embeddings and derive a fast algorithm to obtain the topic structures. We conduct extensive experiments on three real-world datasets to evaluate the effectiveness of CETM. Quantitatively, compared to state-of-the-art topic modeling approaches, CETM achieves significantly better performance in terms of topic coherence and text classification. Qualitatively, CETM is able to learn more coherent topics and more accurate word distributions for each topic.



中文翻译:

通过光谱分解和具有协调的全局和局部上下文的聚类发现主题

主题建模由于其广泛的应用(例如信息检索,观点提取和作者身份识别)而成为活跃的研究领域。它旨在从文档集合中发现主题结构。潜在狄利克雷分配(LDA)及其变体已取得重大进展。但是,通常使用常规方法对整个文档进行“词袋”假设,而这种假设忽略了在主题建模和文档理解中起关键作用的局部上下文的语义。在本文中,我们提出了一种新颖的协作嵌入主题模型(CETM),该模型通过利用全局和局部上下文信息来发现主题,从而结合了频谱分解和聚类技术。特别是,CETM通过使用频谱分解来学习协调嵌入,有效地捕获单词语义关系。为了推断主题分布,我们采用一种聚类算法来捕获协作嵌入的语义质心,并派生出一种快速算法来获取主题结构。我们对三个现实世界的数据集进行了广泛的实验,以评估CETM的有效性。在数量上,与最新的主题建模方法相比,CETM在主题连贯性和文本分类方面实现了明显更好的性能。定性地,CETM能够学习更多连贯的主题,并且每个主题的单词分布更加准确。我们对三个真实的数据集进行了广泛的实验,以评估CETM的有效性。在数量上,与最新的主题建模方法相比,CETM在主题连贯性和文本分类方面实现了明显更好的性能。定性地,CETM能够学习更多连贯的主题,并且每个主题的单词分布更加准确。我们对三个真实的数据集进行了广泛的实验,以评估CETM的有效性。在数量上,与最新的主题建模方法相比,CETM在主题连贯性和文本分类方面实现了明显更好的性能。定性地,CETM能够学习更多连贯的主题,并且每个主题的单词分布更加准确。

更新日期:2020-05-16
down
wechat
bug