Monolingual and multilingual topic analysis using LDA and BERT embeddings,Journal of Informetrics

当前位置： X-MOL 学术 › J. Informetr. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Monolingual and multilingual topic analysis using LDA and BERT embeddings
Journal of Informetrics ( IF 3.4 ) Pub Date : 2020-06-25 , DOI: 10.1016/j.joi.2020.101055
Qing Xie , Xinyuan Zhang , Ying Ding , Min Song

Analyzing research topics offers potential insights into the direction of scientific development. In particular, analyzing multilingual research topics can help researchers grasp the evolution of topics globally, revealing topic similarity among scientific publications written in different languages. Most studies to date on topic analysis have been based on English-language publications and have relied heavily on citation-based topic evolution analysis. However, since it can be challenging for English publications to cite non-English sources and since many languages do not offer English translations of abstracts, citation-based methodologies are not suitable for analyzing multilingual research topic relations. Since multilingual sentence embeddings can effectively preserve word semantics in multilingual translation tasks, a topic model based on multilingual sentence embeddings could potentially generate topic–word distributions for publications in multilingual analysis. In this paper, which is situated in the field of library and information science, we use multilingual pretrained Bidirectional Encoder Representations from Transformers (BERT) embeddings and the Latent Dirichlet Allocation (LDA) topic model to analyze topic evolution in monolingual and multilingual topic similarity settings. For each topic, we multiply its LDA probability value by the averaged tensor similarity of BERT embeddings to explore the evolution of the topic in scientific publications. As our proposed method does not rely on a machine translator or the author's subjective translation, it avoids confusion and misusages caused by either machine error or the author's subjectively chosen English keywords. Our results show that the proposed approach is well-suited to analyzing the scientific evolutions in monolingual and scientific multilingual topic similarity relations.

中文翻译：

使用LDA和BERT嵌入的单语和多语主题分析

分析研究主题可提供对科学发展方向的潜在见解。特别是，分析多语言研究主题可以帮助研究人员全局把握主题的发展，揭示以不同语言编写的科学出版物之间的主题相似性。迄今为止，有关主题分析的大多数研究都基于英语出版物，并且在很大程度上依赖于基于引文的主题演变分析。但是，由于英语出版物引用非英语来源的资源具有挑战性，并且由于许多语言不提供摘要的英语翻译，因此基于引文的方法不适用于分析多语言研究主题之间的关系。由于多语言句子嵌入可以在多语言翻译任务中有效保留单词语义，基于多语言句子嵌入的主题模型可能会为多语言分析中的出版物生成主题-单词分布。本文位于图书馆和信息科学领域，我们使用来自变压器的多语言预训练双向编码器表示法（BERT）嵌入和潜在狄利克雷分配（LDA）主题模型来分析单语言和多语言主题相似性设置中的主题演化。对于每个主题，我们将其LDA概率值乘以BERT嵌入的平均张量相似度，以探索该主题在科学出版物中的发展。由于我们提出的方法不依赖机器翻译者或作者的主观翻译，因此避免了由于机器错误或作者的不当而造成的混淆和误用。主观选择的英语关键字。我们的结果表明，所提出的方法非常适合分析单语和科学多语主题相似性关系中的科学发展。

更新日期：2020-06-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11