当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Topic extraction from extremely short texts with variational manifold regularization
Machine Learning ( IF 4.3 ) Pub Date : 2021-04-16 , DOI: 10.1007/s10994-021-05962-3
Ximing Li , Yang Wang , Jihong Ouyang , Meng Wang

With the emerging of massive short texts, e.g., social media posts and question titles from Q&A systems, discovering valuable information from them is increasingly significant for many real-world applications of content analysis. The family of topic modeling can effectively explore the hidden structures of documents through the assumptions of latent topics. However, due to the sparseness of short texts, the existing topic models, e.g., latent Dirichlet allocation, lose effectiveness on them. To this end, an effective solution, namely Dirichlet multinomial mixture (DMM), supposing that each short text is only associated with a single topic, indirectly enriches document-level word co-occurrences. However, DMM is sensitive to noisy words, where it often learns inaccurate topic representations at the document level. To address this problem, we extend DMM to a novel Laplacian Dirichlet Multinomial Mixture (LapDMM) topic model for short texts. The basic idea of LapDMM is to preserve local neighborhood structures of short texts, enabling to spread topical signals among neighboring documents, so as to modify the inaccurate topic representations. This is achieved by incorporating the variational manifold regularization into the variational objective of DMM, constraining the close short texts with similar variational topic representations. To find nearest neighbors of short texts, before model inference, we construct an offline document graph, where the distances of short texts can be computed by the word mover’s distance. We further develop an online version of LapDMM, namely Online LapDMM, to achieve inference speedup on massive short texts. Carrying this implications, we exploit the spirit of stochastic optimization with mini-batches and an up-to-date document graph that can efficiently find approximate nearest neighbors instead. To evaluate our models, we compare against the state-of-the-art short text topic models on several traditional tasks, i.e., topic quality, document clustering and classification. The empirical results demonstrate that our models achieve very significant performance gains over the baseline models.



中文翻译:

使用变分流形正规化从极短的文本中提取主题

随着大量短文本的出现,例如来自问答系统的社交媒体帖子和问题标题,从中发现有价值的信息对于内容分析的许多实际应用变得越来越重要。主题建模族可以通过潜在主题的假设来有效地探索文档的隐藏结构。但是,由于短文本的稀疏性,现有的主题模型(例如潜在的Dirichlet分配)对其无效。为此,一种有效的解决方案,即狄利克雷(Dirichlet)多项混合(DMM),假定每个短文本仅与一个主题相关联,则间接地丰富了文档级单词的同时出现。但是,DMM对嘈杂的单词敏感,因为它经常在文档级别学习不准确的主题表示。为了解决这个问题,lacian d irichlet中号ultinomial中号ixture(LapDMM)主题模型的短文本。LapDMM的基本思想是保留短文本的本地邻域结构,从而能够在相邻文档之间传播主题信号,从而修改不准确的主题表示形式。这是通过将变型流形正规化合并到DMM的变型目标中来实现的,用相似的变型主题表示形式约束短文本。为了在模型推断之前找到短文本的最近邻居,我们构造了一个离线文档图,其中可以通过单词移动器的距离来计算短文本的距离。我们进一步开发了LapDMM的在线版本,即O nline LapDMM,以实现对大量短文本的推理加速。秉承这种含义,我们利用迷你批处理和最新文档图开发了随机优化的精神,该文档图可以有效地找到近似的最近邻居。为了评估我们的模型,我们在几个传统任务(即主题质量,文档聚类和分类)上与最新的短文本主题模型进行了比较。实证结果表明,与基线模型相比,我们的模型获得了非常可观的性能提升。

更新日期:2021-04-16
down
wechat
bug