当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A neural topic model with word vectors and entity vectors for short texts
Information Processing & Management ( IF 8.6 ) Pub Date : 2020-12-11 , DOI: 10.1016/j.ipm.2020.102455
Xiaowei Zhao , Deqing Wang , Zhengyang Zhao , Wei Liu , Chenwei Lu , Fuzhen Zhuang

Traditional topic models are widely used for semantic discovery from long texts. However, they usually fail to mine high-quality topics from short texts (e.g. tweets) due to the sparsity of features and the lack of word co-occurrence patterns. In this paper, we propose a Variational Auto-Encoder Topic Model (VAETM for short) by combining word vector representation and entity vector representation to address the above limitations. Specifically, we first learn embedding representations of each word and each entity by employing a large-scale external corpora and a large and manually edited knowledge graph, respectively. Then we integrated the embedding representations into the variational auto-encoder framework and propose an unsupervised model named VAETM to infer the latent representation of topic distributions. To further boost VAETM, we propose an improved supervised VAETM (SVAETM for short) by considering label information in training set to supervise the inference of latent representation of topic distributions and the generation of topics. Last, we propose KL-divergence-based inference algorithms to infer approximate posterior distribution for our two models. Extensive experiments on three common short text datasets demonstrate our proposed VAETM and SVAETM outperform various kinds of state-of-the-art models in terms of perplexity, NPMI, and accuracy.



中文翻译:

具有单词向量和实体向量的短文本的神经主题模型

传统主题模型被广泛用于从长文本中进行语义发现。但是,由于功能的稀疏性和缺乏单词共现模式,他们通常无法从短文本(例如推文)中挖掘高质量的主题。在本文中,我们提出了一个V ariational一个uto- é ncoder牛逼OPIC中号odel(简称VAETM)通过组合词向量表示法和实体向量表示法来解决上述限制。具体来说,我们首先通过分别使用大型外部语料库和大型手动编辑的知识图来学习每个单词和每个实体的嵌入表示。然后,我们将嵌入表示形式集成到变分自动编码器框架中,并提出了一个名为VAETM的无监督模型来推断主题分布的潜在表示形式。为了进一步增强VAETM,我们提出了一种改进的监督式VAETM(简称SVAETM),它通过考虑训练集中的标签信息来监督对主题分布和主题生成的潜在表示的推断。持续,我们提出了基于KL散度的推理算法来为我们的两个模型推断近似后验分布。在三个常见的短文本数据集上进行的大量实验表明,我们提出的VAETM和SVAETM在复杂性,NPMI和准确性方面均优于各种最新模型。

更新日期:2020-12-13
down
wechat
bug