Agreeing to Disagree: Choosing Among Eight Topic-Modeling Methods,Big Data Research

当前位置： X-MOL 学术 › Big Data Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Agreeing to Disagree: Choosing Among Eight Topic-Modeling Methods
Big Data Research ( IF 3.5 ) Pub Date : 2020-12-16 , DOI: 10.1016/j.bdr.2020.100173
Qiang Fu , Yufan Zhuang , Jiaxin Gu , Yushu Zhu , Xin Guo

Topic modeling is a key research area in natural language processing and has inspired innovative studies in a wide array of social-science disciplines. Yet, the use of topic modeling in computational social science has been hampered by two critical issues. First, social scientists tend to focus on a few standard ways of topic modeling. Our understanding of semantic patterns has not been informed by rapid methodological advances in topic modeling. Moreover, a systematic comparison of the performance of different methods in this field is warranted. Second, the choice of the optimal number of topics remains a challenging task. A comparison of topic-modeling techniques has rarely been situated in a social-science context and the choice appears to be arbitrary for most social scientists. Based on about 120,000 Canadian newspaper articles since 1977, we review and compare eight traditional, generative, and neural methods for topic modeling (Latent Semantic Analysis, Principal Component Analysis, Factor Analysis, Non-negative Matrix Factorization, Latent Dirichlet Allocation, Neural Autoregressive Topic Model, Neural Variational Document Model, and Hierarchical Dirichlet Process). Three measures (coherence statistics, held-out likelihood, and graph-based dimensionality selection) are then used to assess the performance of these methods. Findings are presented and discussed to guide the choice of topic-modeling methods, especially in social science research.

中文翻译：

同意不同意：在八种主题建模方法中进行选择

主题建模是自然语言处理中的一个关键研究领域，它激发了广泛的社会科学学科的创新研究。然而，在两个主题中，主题模型在计算社会科学中的使用受到阻碍。首先，社会科学家倾向于关注主题建模的几种标准方法。主题建模的快速方法学进步并未为我们对语义模式的理解提供帮助。此外，必须对本领域中不同方法的性能进行系统比较。其次，选择最佳主题数仍然是一项艰巨的任务。主题建模技术的比较很少是在社会科学背景下进行的，对于大多数社会科学家来说，选择似乎是任意的。根据1977年以来加拿大约12万份报纸的报道，我们审查并比较了用于主题建模的八种传统，生成和神经方法（潜在语义分析，主成分分析，因子分析，非负矩阵分解，潜在Dirichlet分配，神经自回归主题模型，神经变分文档模型和分层Dirichlet处理）。然后使用三个度量（相干统计量，保持可能性和基于图的维数选择）来评估这些方法的性能。提出和讨论发现以指导主题建模方法的选择，尤其是在社会科学研究中。神经自回归主题模型，神经变异文档模型和分层Dirichlet过程）。然后使用三个度量（相干统计量，保持可能性和基于图的维数选择）来评估这些方法的性能。提出和讨论发现以指导主题建模方法的选择，尤其是在社会科学研究中。神经自回归主题模型，神经变异文档模型和分层Dirichlet过程）。然后使用三个度量（相干统计量，保持的可能性和基于图的维数选择）来评估这些方法的性能。提出和讨论发现以指导主题建模方法的选择，尤其是在社会科学研究中。

更新日期：2020-12-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文