A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling,Symmetry

当前位置： X-MOL 学术 › Symmetry › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling
Symmetry ( IF 2.2 ) Pub Date : 2021-05-10 , DOI: 10.3390/sym13050837
Olzhas Kozbagarov , Rustam Mussabayev , Nenad Mladenovic

This article presents a new conceptual approach for the interpretative topic modeling problem. It uses sentences as basic units of analysis, instead of words or n-grams, which are commonly used in the standard approaches.The proposed approach’s specifics are using sentence probability evaluations within the text corpus and clustering of sentence embeddings. The topic model estimates discrete distributions of sentence occurrences within topics and discrete distributions of topic occurrence within the text. Our approach provides the possibility of explicit interpretation of topics since sentences, unlike words, are more informative and have complete grammatical and semantic constructions inside. The method for automatic topic labeling is also provided. Contextual embeddings based on the BERT model are used to obtain corresponding sentence embeddings for their subsequent analysis. Moreover, our approach allows big data processing and shows the possibility of utilizing the combination of internal and external knowledge sources in the process of topic modeling. The internal knowledge source is represented by the text corpus itself and often it is a single knowledge source in the traditional topic modeling approaches. The external knowledge source is represented by the BERT, a machine learning model which was preliminarily trained on a huge amount of textual data and is used for generating the context-dependent sentence embeddings.

中文翻译：

一种新的基于句子的解释性主题建模和自动主题标注

本文为解释性主题建模问题提出了一种新的概念方法。它使用句子作为分析的基本单位，而不是标准方法中常用的单词或n-gram。所提出的方法的详细信息是使用文本语料库中的句子概率评估和句子嵌入的聚类。主题模型估计主题内句子出现的离散分布和文本内主题出现的离散分布。我们的方法提供了对主题进行明确解释的可能性，因为与单词不同，句子具有更丰富的信息，并且内部具有完整的语法和语义结构。还提供了自动主题标记的方法。基于BERT模型的上下文嵌入可用于获取相应的句子嵌入，以进行后续分析。此外，我们的方法允许进行大数据处理，并显示了在主题建模过程中利用内部和外部知识源的组合的可能性。内部知识源由文本语料库本身表示，并且通常是传统主题建模方法中的单个知识源。外部知识源由BERT表示，BERT是一种机器学习模型，已针对大量文本数据进行了初步培训，并用于生成上下文相关的句子嵌入。我们的方法允许进行大数据处理，并显示了在主题建模过程中利用内部和外部知识源的组合的可能性。内部知识源由文本语料库本身表示，并且通常是传统主题建模方法中的单个知识源。外部知识源由BERT表示，BERT是一种机器学习模型，已针对大量文本数据进行了初步培训，并用于生成上下文相关的句子嵌入。我们的方法允许进行大数据处理，并显示了在主题建模过程中利用内部和外部知识源的组合的可能性。内部知识源由文本语料库本身表示，并且通常是传统主题建模方法中的单个知识源。外部知识源由BERT表示，BERT是一种机器学习模型，已针对大量文本数据进行了初步培训，并用于生成上下文相关的句子嵌入。

更新日期：2021-05-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文