A new anchor word selection method for the separable topic discovery,WIREs Data Mining and Knowledge Discovery

当前位置： X-MOL 学术 › WIREs Data Mining Knowl. Discov. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A new anchor word selection method for the separable topic discovery
WIREs Data Mining and Knowledge Discovery ( IF 7.8 ) Pub Date : 2019-05-13 , DOI: 10.1002/widm.1313
Kun He ₁ , Wu Wang ₁ , Xiaosen Wang ₁ , John E. Hopcroft ₂

Affiliation

Separable nonnegative matrix factorization (SNMF) is an important method for topic modeling, where “separable” assumes every topic contains at least one anchor word, defined as a word that has non‐zero probability only on that topic. SNMF focuses on the word co‐occurrence patterns to reveal topics by two steps: anchor word selection and topic recovery. The quality of the anchor words strongly influences the quality of the extracted topics. Existing anchor word selection algorithm is to greedily find an approximate convex hull in a high‐dimensional word co‐occurrence space. In this work, we propose a new method for the anchor word selection by associating the word co‐occurrence probability with the words similarity and assuming that the most different words on semantic are potential candidates for the anchor words. Therefore, if the similarity of a word‐pair is very low, the two words are very likely to be the anchor words. According to the statistical information of text corpora, we can get the similarity of all word‐pairs. We build the word similarity graph where the nodes correspond to words and weights on edges stand for the word‐pair similarity. Following this way, we design a greedy method to find a minimum edge‐weight anchor clique of a given size in the graph for the anchor word selection. Extensive experiments on real‐world corpus demonstrate the effectiveness of the proposed anchor word selection method that outperforms the common convex hull‐based methods on the revealed topic quality. Meanwhile, our method is much faster than typical SNMF‐based method.

中文翻译：

一种可分离主题发现的新锚词选择方法

可分离的非负矩阵分解（SNMF）是用于主题建模的一种重要方法，其中“可分离”假定每个主题都包含至少一个锚词，该锚词定义为仅对该主题具有非零概率的词。SNMF专注于单词共现模式，通过两个步骤来揭示主题：锚定单词选择和主题恢复。锚词的质量强烈影响所提取主题的质量。现有的锚定词选择算法是在高维词共现空间中贪婪地找到一个近似凸包。在这项工作中，我们通过将单词共现概率与单词相似度相关联，并假设语义上最不同的单词是锚定单词的潜在候选者，提出了一种新的锚定单词选择方法。因此，如果一个词对的相似度很低，则这两个词很可能是锚词。根据文本语料库的统计信息，我们可以得到所有词对的相似性。我们建立单词相似度图，其中节点与单词相对应，边上的权重代表单词对相似度。按照这种方式，我们设计了一种贪婪方法，以在图形中找到给定大小的最小边沿权重锚群，以进行锚词选择。在现实语料库上的大量实验证明，所提出的锚词选择方法在所揭示的主题质量上优于基于凸包的通用方法。同时，我们的方法比典型的基于SNMF的方法要快得多。根据文本语料库的统计信息，我们可以得到所有词对的相似性。我们建立单词相似度图，其中节点与单词相对应，边上的权重代表单词对相似度。按照这种方式，我们设计了一种贪婪方法，以在图形中找到给定大小的最小边沿权重锚群，以进行锚词选择。在现实语料库上的大量实验证明，所提出的锚词选择方法在所揭示的主题质量上优于基于凸包的通用方法。同时，我们的方法比典型的基于SNMF的方法要快得多。根据文本语料库的统计信息，我们可以得到所有词对的相似性。我们建立单词相似度图，其中节点与单词相对应，边上的权重代表单词对相似度。按照这种方式，我们设计了一种贪婪方法，以在图形中找到给定大小的最小边沿权重锚群，以进行锚词选择。在现实语料库上的大量实验证明，所提出的锚词选择方法在所揭示的主题质量上优于基于凸包的通用方法。同时，我们的方法比典型的基于SNMF的方法要快得多。按照这种方式，我们设计了一种贪婪方法，以在图形中找到给定大小的最小边沿权重锚群，以进行锚词选择。在现实语料库上的大量实验证明，所提出的锚词选择方法在所揭示的主题质量上优于基于凸包的通用方法。同时，我们的方法比典型的基于SNMF的方法要快得多。按照这种方式，我们设计了一种贪婪方法，以在图形中找到给定大小的最小边沿权重锚群，以进行锚词选择。在现实语料库上的大量实验证明，所提出的锚词选择方法在所揭示的主题质量上优于基于凸包的通用方法。同时，我们的方法比典型的基于SNMF的方法要快得多。

更新日期：2019-05-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>