Topic representation model based on microblogging behavior analysis,World Wide Web

当前位置： X-MOL 学术 › World Wide Web › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Topic representation model based on microblogging behavior analysis
World Wide Web ( IF 2.7 ) Pub Date : 2020-06-15 , DOI: 10.1007/s11280-020-00822-x
Weihong Han , Zhihong Tian , Zizhong Huang , Shudong Li , Yan Jia

With the development of microblogging, it has become an important way for people to obtain information, express opinions, and make suggestions. Identifying new topics quickly and accurately from the massive microblogging data plays a crucial role for recommending information and controlling public opinion. The topic representation model provides a basis for topic detection. In this paper, we propose a topic representation model based on user behavior analysis, i.e., microblogging behavior analysis-latent Dirichlet allocation (MBA-LDA) model, for microblogging datasets. Topic-word distribution is acquired by the LDA model which considers information on user behaviors (such as posting, forwarding and commenting) and word distribution among documents within one topic and among different topics. The model also re-assesses the importance of words in topic representation. The basic idea is that the distribution of words within a topic or among different topics has a great influence on the selection of topic expression words. If a word is evenly distributed among all documents of a certain topic, it indicates that the word is the common word of all documents in the topic, and it is more suitable to represent this topic. If a word is more evenly distributed among various topics, it indicates that the word is the common word of all topics, and it can’t achieve the purpose of distinguishing topics, so it is less suitable to represent any topic. By experiments with Sina Microblogging’s actual data set, the topic model based on the MBA-LDA algorithm makes the representative words more important and increases the differentiation of topic words, which effectively improves the accuracy of subsequent topic detection and evolutionary analysis.

中文翻译：

基于微博行为分析的主题表示模型

随着微博的发展，它已成为人们获取信息，发表意见和提出建议的重要途径。从海量微博数据中快速，准确地识别新话题对于推荐信息和控制公众舆论起着至关重要的作用。主题表示模型为主题检测提供了基础。在本文中，我们提出了一种基于用户行为分析的主题表示模型，即针对微博数据集的微博行为分析-潜在狄利克雷分配（MBA-LDA）模型。主题词分布是通过LDA模型获取的，该模型考虑了有关用户行为（例如发布，转发和评论）的信息以及一个主题内和不同主题之间的文档之间的词分布。该模型还重新评估了单词在主题表示中的重要性。基本思想是，一个主题内或不同主题之间的单词分布对主题表达单词的选择有很大的影响。如果某个单词在某个主题的所有文档中平均分配，则表示该单词是该主题中所有文档的通用单词，更适合表示该主题。如果一个单词在各个主题之间分布比较均匀，则表示该单词是所有主题的通用单词，不能达到区分主题的目的，因此不太适合表示任何主题。通过使用新浪微博的实际数据集进行实验，基于MBA-LDA算法的主题模型使代表词变得更加重要，并增加了主题词的区分度，

更新日期：2020-06-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文