当前位置: X-MOL 学术Data Technol. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings
Data Technologies and Applications ( IF 1.6 ) Pub Date : 2021-04-29 , DOI: 10.1108/dta-02-2021-0039
Heng-Yang Lu , Yi Zhang , Yuntao Du

Purpose

Topic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet Allocation may suffer from the sparsity problem when dealing with short texts, which mostly come from the Web. These models also exist the readability problem when displaying the discovered topics. The purpose of this paper is to propose a novel model called the Sense Unit based Phrase Topic Model (SenU-PTM) for both the sparsity and readability problems.

Design/methodology/approach

SenU-PTM is a novel phrase-based short-text topic model under a two-phase framework. The first phase introduces a phrase-generation algorithm by exploiting word embeddings, which aims to generate phrases with the original corpus. The second phase introduces a new concept of sense unit, which consists of a set of semantically similar tokens for modeling topics with token vectors generated in the first phase. Finally, SenU-PTM infers topics based on the above two phases.

Findings

Experimental results on two real-world and publicly available datasets show the effectiveness of SenU-PTM from the perspectives of topical quality and document characterization. It reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.

Originality/value

The originality of SenU-PTM lies in the new procedure of modeling topics on the proposed sense units with word embeddings for short-text topic discovery.



中文翻译:

SenU-PTM:一种新的基于短语的主题模型,用于利用词嵌入进行短文本主题发现

目的

主题模型已被广泛应用于从海量非结构化数据中发现重要信息。传统的长文本主题模型(例如潜在狄利克雷分配)在处理主要来自 Web 的短文本时可能会遇到稀疏问题。这些模型在显示发现的主题时也存在可读性问题。本文的目的是针对稀疏性和可读性问题提出一种称为基于语义单元的短语主题模型(SenU-PTM)的新模型。

设计/方法/方法

SenU-PTM 是一种在两阶段框架下的新型基于短语的短文本主题模型。第一阶段通过利用词嵌入引入短语生成算法,旨在用原始语料库生成短语。第二阶段引入了感知单元的新概念,它由一组语义相似的标记组成,用于使用第一阶段生成的标记向量对主题进行建模。最后,SenU-PTM 基于上述两个阶段推断主题。

发现

两个真实世界和公开可用数据集的实验结果从主题质量和文档特征的角度显示了 SenU-PTM 的有效性。它揭示了在语义单元上建模主题可以解决短文本的稀疏性,同时提高主题的可读性。

原创性/价值

SenU-PTM 的独创性在于在提出的语义单元上建模主题的新程序,用于短文本主题发现的词嵌入。

更新日期:2021-04-29
down
wechat
bug