To read this content please select one of the options below:

SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings

Heng-Yang Lu (Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Jiangnan University, Wuxi, China) (School of Internet of Things Engineering, Jiangnan University, Wuxi, China) (National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China)
Yi Zhang (National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China)
Yuntao Du (National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China)

Data Technologies and Applications

ISSN: 2514-9288

Article publication date: 29 April 2021

Issue publication date: 11 October 2021

236

Abstract

Purpose

Topic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet Allocation may suffer from the sparsity problem when dealing with short texts, which mostly come from the Web. These models also exist the readability problem when displaying the discovered topics. The purpose of this paper is to propose a novel model called the Sense Unit based Phrase Topic Model (SenU-PTM) for both the sparsity and readability problems.

Design/methodology/approach

SenU-PTM is a novel phrase-based short-text topic model under a two-phase framework. The first phase introduces a phrase-generation algorithm by exploiting word embeddings, which aims to generate phrases with the original corpus. The second phase introduces a new concept of sense unit, which consists of a set of semantically similar tokens for modeling topics with token vectors generated in the first phase. Finally, SenU-PTM infers topics based on the above two phases.

Findings

Experimental results on two real-world and publicly available datasets show the effectiveness of SenU-PTM from the perspectives of topical quality and document characterization. It reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.

Originality/value

The originality of SenU-PTM lies in the new procedure of modeling topics on the proposed sense units with word embeddings for short-text topic discovery.

Keywords

Acknowledgements

This research was funded by the National Natural Science Foundation of China [Grant No. 62002137], the Fundamental Research Funds for the Central Universities [No. JUSRP12021] and the State Key Lab. for Novel Software Technology, Nanjing University, P.R. China [No. KFKT2020B02].

Citation

Lu, H.-Y., Zhang, Y. and Du, Y. (2021), "SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings", Data Technologies and Applications, Vol. 55 No. 5, pp. 643-660. https://doi.org/10.1108/DTA-02-2021-0039

Publisher

:

Emerald Publishing Limited

Copyright © 2021, Emerald Publishing Limited

Related articles