当前位置: X-MOL 学术Computing › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A parallel text clustering method using Spark and hashing
Computing ( IF 3.3 ) Pub Date : 2021-04-07 , DOI: 10.1007/s00607-021-00932-y
Mohamed Aymen Ben HajKacem , Chiheb-Eddine Ben N’cir , Nadia Essoussi

Clustering textual data has become an important task in data analytics since several applications require to automatically organizing large amounts of textual documents into homogeneous topics. The increasing growth of available textual data from web, social networks and open platforms have challenged this task. It becomes important to design scalable clustering method able to effectively organize huge amount of textual data into topics. In this context, we propose a new parallel text clustering method based on Spark framework and hashing. The proposed method deals simultaneously with the issue of clustering huge amount of documents and the issue of high dimensionality of textual data by respectively integrating the divide and conquer approach and implementing a new document hashing strategy. These two facts have shown an important improvement of scalability and a good approximation of clustering quality results. Experiments performed on several large collections of documents have shown the effectiveness of the proposed method compared to existing ones in terms of running time and clustering accuracy.



中文翻译:

使用Spark和哈希的并行文本聚类方法

由于多个应用程序需要将大量文本文档自动组织为同类主题,因此将文本数据聚类已成为数据分析中的重要任务。来自网络,社交网络和开放平台的可用文本数据的增长不断增长,对这一任务提出了挑战。设计能够有效地将大量文本数据组织到主题中的可伸缩聚类方法变得很重要。在这种情况下,我们提出了一种新的基于Spark框架和哈希的并行文本聚类方法。通过分别整合分治法和实现一种新的文档哈希策略,该方法同时解决了海量文档的聚类问题和文本数据的高维问题。这两个事实表明可伸缩性得到了重要的改进,并且聚类质量结果得到了很好的近似。在大量文档集合上进行的实验表明,与现有方法相比,该方法在运行时间和聚类精度方面是有效的。

更新日期:2021-04-08
down
wechat
bug