Improving biterm topic model with word embeddings,World Wide Web

当前位置： X-MOL 学术 › World Wide Web › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improving biterm topic model with word embeddings
World Wide Web ( IF 3.7 ) Pub Date : 2020-09-08 , DOI: 10.1007/s11280-020-00823-w
Jiajia Huang , Min Peng , Pengwei Li , Zhiwei Hu , Chao Xu

As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in texts for topic inference. However, it is usually hard to extract a group of words that are semantically coherent and have competent representation ability when the models applied into short texts. It is because the feature space of the short texts is too sparse to provide enough co-occurrence information for topic inference. The continuous development of word embeddings brings new representation of words and more effective measurement of word semantic similarity from concept perspective. In this study, we first mine word co-occurrence patterns (i.e., biterms) from short text corpus and then calculate biterm frequency and semantic similarity between its two words. The result shows that a biterm with higher frequency or semantic similarity usually has more similar words in the corpus. Based on the result, we develop a novel probabilistic topic model, named Noise Biterm Topic Model with Word Embeddings (NBTMWE). NBTMWE extends the Biterm Topic Model (BTM) by introducing a noise topic with prior knowledge of frequency and semantic similarity of biterm. NBTMWE shows the following advantages compared with BTM: (1) It can distinguish meaningful latent topics from a noise topic which consists of some common-used words that appear in many texts of the dataset; (2) It can promote a biterm’s semantically related words to the same topic during the sampling process via generalized \(P\acute {o}lya\) Urn (GPU) model. Using auxiliary word embeddings trained from a large scale of corpus, we report the results testing on two short text datasets (i.e., Sina Weibo and Web Snippets). Quantitatively, NBTMWE outperforms the state-of-the-art models in terms of coherence, topic word similarity and classification accuracy. Qualitatively, each of the topics generated by NBTMWE contains more semantically similar words and shows superior intelligibility.

中文翻译：

用词嵌入改进双项主题模型

作为基本的信息提取方法之一，主题模型已广泛应用于文本聚类，信息推荐和其他文本分析任务中。传统的主题模型主要利用文本中的单词共现信息进行主题推理。但是，当将模型应用到短文本中时，通常很难提取出语义上连贯且具有胜任表示能力的一组单词。这是因为短文本的特征空间太稀疏，无法为主题推断提供足够的共现信息。从概念的角度来看，词嵌入的不断发展带来了词的新表示形式和对词语义相似性的更有效度量。在这项研究中，我们首先挖掘单词共现模式（即 biterms），然后计算其两个单词之间的biterm频率和语义相似度。结果表明，具有较高频率或语义相似性的双向词通常在语料库中具有更多相似词。基于结果，我们开发了一种新的概率主题模型，即带有词嵌入的噪声双项主题模型（NBTMWE）。NBTMWE通过引入噪声主题来扩展Biterm主题模型（BTM），该主题具有Biterm的频率和语义相似性的先验知识。NBTMWE与BTM相比显示出以下优点：（1）它可以将有意义的潜在主题与噪声主题区分开来，噪声主题由出现在数据集许多文本中的一些常用单词组成；（2）可以通过广义化将双项语义相关的单词在采样过程中提升为同一主题\（P \ acute {o} lya \）缸（GPU）模型。使用经过大规模语料库训练的辅助词嵌入，我们报告了对两个短文本数据集（即新浪微博和Web片段）的测试结果。在一致性，主题词相似性和分类准确性方面，NBTMWE在数量上均优于最新模型。从质量上讲，NBTMWE生成的每个主题都包含更多语义相似的单词，并显示出较高的清晰度。

更新日期：2020-09-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>