当前位置: X-MOL 学术World Wide Web › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Pseudo-document-based Topical N-grams model for short texts
World Wide Web ( IF 3.7 ) Pub Date : 2020-07-23 , DOI: 10.1007/s11280-020-00814-x
Hao Lin , Yuan Zuo , Guannan Liu , Hong Li , Junjie Wu , Zhiang Wu

In recent years, short text topic modeling has drawn considerable attentions from interdisciplinary researchers. Various customized topic models have been proposed to tackle the semantic sparseness nature of short texts. Most (if not all) of them follow the bag-of-words assumption, which, however, is not adequate since word order and phrases are often critical to capturing the meaning of texts. On the other hand, while some existing topic models are sensitive to word order, they do not perform well on short texts due to the severe data sparseness. To address these issues, we propose the Pseudo-document-based Topical N-Grams model (PTNG), which alleviates the data sparsity problem of short texts while is sensitive to word order. Extensive experiments on three real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTNG according to UCI coherence scores and more discriminative semantic representation of short texts according to classification results.



中文翻译:

基于伪文档的主题N-gram模型用于短文本

近年来,短文本主题建模引起了跨学科研究人员的极大关注。已经提出了各种定制主题模型来解决短文本的语义稀疏性。他们中的大多数(如果不是全部)都遵循口碑假设,但这是不充分的,因为单词顺序和短语对于获取文本的含义通常至关重要。另一方面,尽管一些现有的主题模型对单词顺序敏感,但由于数据稀疏,它们在短文本上的效果不佳。为了解决这些问题,我们提出了一种基于伪文档的局部N语法模型(PTNG),该模型减轻了短文本对词序敏感的数据稀疏性问题。在三个具有最新基准的真实世界数据集上进行的广泛实验证明,PTNG根据UCI连贯性评分可以学习到高质量的主题,并且可以根据分类结果对短文本进行更具区分性的语义表示。

更新日期:2020-07-23
down
wechat
bug