当前位置: X-MOL 学术Stat. Anal. Data Min. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Assessing topic model relevance: Evaluation and informative priors
Statistical Analysis and Data Mining ( IF 1.3 ) Pub Date : 2019-05-02 , DOI: 10.1002/sam.11415
Angela Fan 1 , Finale Doshi‐Velez 1 , Luke Miratrix 1
Affiliation  

Latent Dirichlet allocation (LDA) models trained without stopword removal often produce topics with high posterior probabilities on uninformative words, obscuring the underlying corpus content. Even when canonical stopwords are manually removed, uninformative words common in that corpus will still dominate the most probable words in a topic. In this work, we first show how the standard topic quality measures of coherence and pointwise mutual information act counter‐intuitively in the presence of common but irrelevant words, making it difficult to even quantitatively identify situations in which topics may be dominated by stopwords. We propose an additional topic quality metric that targets the stopword problem, and show that it, unlike the standard measures, correctly correlates with human judgments of quality as defined by concentration of information‐rich words. We also propose a simple‐to‐implement strategy for generating topics that are evaluated to be of much higher quality by both human assessment and our new metric. This approach, a collection of informative priors easily introduced into most LDA‐style inference methods, automatically promotes terms with domain relevance and demotes domain‐specific stop words. We demonstrate this approach's effectiveness in three very different domains: Department of Labor accident reports, online health forum posts, and NIPS abstracts. Overall we find that current practices thought to solve this problem do not do so adequately, and that our proposal offers a substantial improvement for those interested in interpreting their topics as objects in their own right.

中文翻译:

评估主题模型的相关性:评估和先验知识

在没有停用词删除的情况下训练的潜在狄利克雷分配(LDA)模型通常会在非信息性词上产生后验概率较高的主题,从而掩盖了基础语料库的内容。即使手动删除了规范停用词,该语料库中常见的非信息性词仍将占据主题中最可能出现的词。在这项工作中,我们首先说明在存在常见但不相关的单词的情况下,连贯性和按点相互信息的标准主题质量度量是如何违反直觉的,从而很难定量地确定主题可能由停用词占主导的情况。我们提出了一个针对停用词问题的附加主题质量指标,并表明与标准指标不同,与信息丰富的单词集中定义的质量的人类判断正确相关。我们还提出了一种易于实施的策略,用于生成通过人工评估和我们的新指标均被评估为质量更高的主题。这种方法是很容易引入大多数LDA风格推断方法的信息先验的集合,可自动提升与领域相关的术语并降级特定领域的停用词。我们在三个非常不同的领域证明了这种方法的有效性:劳工部事故报告,在线健康论坛帖子和NIPS摘要。总体而言,我们发现目前认为解决该问题的做法不能充分解决这一问题,
更新日期:2019-05-02
down
wechat
bug