Nested variational autoencoder for topic modelling on microtexts with word vectors,Expert Systems

当前位置： X-MOL 学术 › Expert Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Nested variational autoencoder for topic modelling on microtexts with word vectors
Expert Systems ( IF 3.3 ) Pub Date : 2020-10-08 , DOI: 10.1111/exsy.12639
Trung Trinh ₁ , Tho Quan ₁ , Trung Mai ₁

Affiliation

Most of the information on the Internet is represented in the form of microtexts, which are short text snippets such as news headlines or tweets. These sources of information are abundant, and mining these data could uncover meaningful insights. Topic modelling is one of the popular methods to extract knowledge from a collection of documents; however, conventional topic models such as latent Dirichlet allocation (LDA) are unable to perform well on short documents, mostly due to the scarcity of word co‐occurrence statistics embedded in the data. The objective of our research is to create a topic model that can achieve great performances on microtexts while requiring a small runtime for scalability to large datasets. To solve the lack of information of microtexts, we allow our method to take advantage of word embeddings for additional knowledge of relationships between words. For speed and scalability, we apply autoencoding variational Bayes, an algorithm that can perform efficient black‐box inference in probabilistic models. The result of our work is a novel topic model called the nested variational autoencoder, which is a distribution that takes into account word vectors and is parameterized by a neural network architecture. For optimization, the model is trained to approximate the posterior distribution of the original LDA model. Experiments show the improvements of our model on microtexts as well as its runtime advantage.

中文翻译：

嵌套变分自动编码器，用于带有词向量的微文本主题建模

互联网上的大多数信息都以微文本形式表示，这是简短的文字摘要，例如新闻标题或推文。这些信息源丰富，挖掘这些数据可以发现有意义的见解。主题建模是从文档集合中提取知识的一种流行方法。但是，传统的主题模型（例如潜在的狄利克雷分配（LDA））在简短的文档上无法很好地发挥作用，这主要是由于数据中嵌入的单词共现统计的缺乏。我们研究的目的是创建一个主题模型，该模型可以在微文本上实现出色的性能，同时又需要较小的运行时间才能扩展到大型数据集。为了解决缺少微文本信息的问题，我们允许我们的方法利用单词嵌入来获得单词之间关系的附加知识。为了提高速度和可扩展性，我们应用自动编码的变分贝叶斯算法，该算法可以在概率模型中执行有效的黑盒推理。我们工作的结果是建立了一个新颖的主题模型，称为嵌套变分自动编码器，它是一种考虑词向量的分布，并由神经网络体系结构进行参数化。为了优化，训练模型以近似原始LDA模型的后验分布。实验表明我们的模型在微文本上的改进及其运行时的优势。

更新日期：2020-10-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>