Topic modelling discourse dynamics in historical newspapers,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Topic modelling discourse dynamics in historical newspapers
arXiv - CS - Computation and Language Pub Date : 2020-11-20 , DOI: arxiv-2011.10428
Jani Marjanen, Elaine Zosa, Simon Hengchen, Lidia Pivovarova, Mikko Tolonen

This paper addresses methodological issues in diachronic data analysis for historical research. We apply two families of topic models (LDA and DTM) on a relatively large set of historical newspapers, with the aim of capturing and understanding discourse dynamics. Our case study focuses on newspapers and periodicals published in Finland between 1854 and 1917, but our method can easily be transposed to any diachronic data. Our main contributions are a) a combined sampling, training and inference procedure for applying topic models to huge and imbalanced diachronic text collections; b) a discussion on the differences between two topic models for this type of data; c) quantifying topic prominence for a period and thus a generalization of document-wise topic assignment to a discourse level; and d) a discussion of the role of humanistic interpretation with regard to analysing discourse dynamics through topic models.

中文翻译：

历史报纸上的话题建模话语动态

本文讨论历时数据分析中的方法学问题，以进行历史研究。我们在相对较大的历史报纸集上应用两个主题模型系列（LDA和DTM），目的是捕获和理解话语动态。我们的案例研究的重点是1854年至1917年在芬兰出版的报纸和期刊，但是我们的方法可以轻松地转换为任何历时性数据。我们的主要贡献是：a）结合了抽样，培训和推理程序，可将主题模型应用于庞大且不平衡的历时性文字集；b）关于这种数据的两个主题模型之间差异的讨论；c）量化一段时间内的主题突出程度，从而将按文档分类的主题分配概括到话语级别；

更新日期：2020-11-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文