当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
EGC: A novel event-oriented graph clustering framework for social media text
Information Processing & Management ( IF 7.4 ) Pub Date : 2022-09-14 , DOI: 10.1016/j.ipm.2022.103059
Die Hu , Dan Feng , Yulai Xie

With the popularity of social platforms such as Sina Weibo, Tweet, etc., a large number of public events spread rapidly on social networks and huge amount of textual data are generated along with the discussion of netizens. Social text clustering has become one of the most critical methods to help people find relevant information and provides quality data for subsequent timely public opinion analysis. Most existing neural clustering methods rely on manual labeling of training sets and take a long time in the learning process. Due to the explosiveness and the large-scale of social media data, it is a challenge for social text data clustering to satisfy the timeliness demand of users. This paper proposes a novel unsupervised event-oriented graph clustering framework (EGC), which can achieve efficient clustering performance on large-scale datasets with less time overhead and does not require any labeled data. Specifically, EGC first mines the potential relations existing in social text data and transforms the textual data of social media into an event-oriented graph by taking advantage of graph structure for complex relations representation. Secondly, EGC uses a keyword-based local importance method to accurately measure the weights of relations in event-oriented graph. Finally, a bidirectional depth-first clustering algorithm based on the interrelations is proposed to cluster the nodes in event-oriented graph. By projecting the relations of the graph into a smaller domain, EGC achieves fast convergence. The experimental results show that the clustering performance of EGC on the Weibo dataset reaches 0.926 (NMI), 0.926 (AMI), 0.866 (ARI), which are 13%–30% higher than other clustering methods. In addition, the average query time of EGC clustered data is 16.7ms, which is 90% less than the original data.



中文翻译:

EGC:一种新颖的面向事件的社交媒体文本图聚类框架

随着新浪微博、推特等社交平台的普及,大量的公共事件在社交网络上迅速传播,伴随着网友的讨论产生了大量的文字数据。社交文本聚类已成为帮助人们找到相关信息并为后续及时舆情分析提供优质数据的最关键方法之一。大多数现有的神经聚类方法都依赖于训练集的手动标记,并且在学习过程中需要很长时间。由于社交媒体数据的爆炸性和海量,社交文本数据聚类如何满足用户的时效性需求是一个挑战。本文提出了一种新颖的无监督的面向事件的图聚类框架(EGC),它可以在大规模数据集上以更少的时间开销实现高效的聚类性能,并且不需要任何标记数据。具体来说,EGC首先挖掘社交文本数据中存在的潜在关系,利用图结构的复杂关系表示,将社交媒体的文本数据转化为面向事件的图。其次,EGC 使用基于关键字的局部重要性方法来准确测量面向事件的图中关系的权重。最后,提出了一种基于相互关系的双向深度优先聚类算法对面向事件图中的节点进行聚类。通过将图的关系投影到更小的域中,EGC 实现了快速收敛。实验结果表明,EGC在微博数据集上的聚类性能达到0.926(NMI),0. 926 (AMI), 0.866 (ARI),比其他聚类方法高 13%–30%。此外,EGC集群数据的平均查询时间为16.7ms,比原始数据减少了90%。

更新日期:2022-09-14
down
wechat
bug