当前位置: X-MOL 学术Chaos Solitons Fractals › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A novel text clustering model based on topic modelling and social network analysis
Chaos, Solitons & Fractals ( IF 5.3 ) Pub Date : 2024-03-02 , DOI: 10.1016/j.chaos.2024.114633
Babak Amiri , Ramin Karimianghadim

Document clustering is a well-known text-mining method that assists in the categorization and comprehension of textual data. Document clustering is vital in areas like information retrieval, knowledge management, and marketing, underscoring the need for a highly accurate clustering model. Current models in document clustering face significant hurdles, such as dealing with sparse, high-dimensional representations based on the bag-of-words (BOW) approach, which are not only computationally demanding on large datasets but also lack in capturing the semantic nuances of documents. Additionally, these models struggle with determining the ideal number of clusters and managing datasets with overlapping elements. To overcome these issues, this paper introduces a novel co-clustering strategy that merges community detection methods from social network analysis with advanced text analysis techniques. The proposed method transforms documents into a network structure, where each document is a node and connections (edges) are formed between documents that are most similar. Community detection algorithms are then employed to identify clusters within this network of documents. The study explores various document representation methods, including topic modelling and sentence embedding, to provide a rich contextual understanding of the documents. An extensive evaluation is carried out, examining different combinations of community detection algorithms, clustering methodologies, and document representation strategies, particularly focusing on their efficacy in handling overlapping and non-overlapping datasets. The findings demonstrate that the Element-Centric evaluation measure is effective in enabling community detection algorithms to autonomously ascertain the most suitable number of clusters, yielding promising results for both overlapping and non-overlapping datasets. The LCD model shows remarkable performance in addressing overlapping datasets. Furthermore, the research reveals that innovative document representation approaches significantly enhance the performance of the models. Additionally, the use of topic modelling in conjunction with co-clustering algorithms proves effective in clearly depicting the themes within the clusters.

中文翻译:


基于主题建模和社交网络分析的新型文本聚类模型



文档聚类是一种众所周知的文本挖掘方法,有助于文本数据的分类和理解。文档聚类在信息检索、知识管理和营销等领域至关重要,这凸显了对高精度聚类模型的需求。当前的文档聚类模型面临着重大障碍,例如处理基于词袋(BOW)方法的稀疏、高维表示,这不仅对大型数据集的计算要求很高,而且缺乏捕获语义细微差别的能力。文件。此外,这些模型还难以确定理想的集群数量和管理具有重叠元素的数据集。为了克服这些问题,本文引入了一种新颖的共聚类策略,该策略将社交网络分析的社区检测方法与先进的文本分析技术相结合。所提出的方法将文档转换为网络结构,其中每个文档都是一个节点,并且在最相似的文档之间形成连接(边)。然后采用社区检测算法来识别该文档网络中的集群。该研究探索了各种文档表示方法,包括主题建模和句子嵌入,以提供对文档的丰富上下文理解。进行了广泛的评估,检查社区检测算法、聚类方法和文档表示策略的不同组合,特别关注它们在处理重叠和非重叠数据集方面的功效。 研究结果表明,以元素为中心的评估措施可以有效地使社区检测算法能够自主确定最合适的聚类数量,从而为重叠和非重叠数据集产生有希望的结果。 LCD 模型在处理重叠数据集方面表现出了卓越的性能。此外,研究表明创新的文档表示方法显着提高了模型的性能。此外,主题建模与共聚类算法的结合使用被证明可以有效地清晰地描述聚类中的主题。
更新日期:2024-03-02
down
wechat
bug