当前位置: X-MOL 学术Computing › 论文详情
A parallel text clustering method using Spark and hashing
Computing ( IF 2.044 ) Pub Date : 2021-04-07 , DOI: 10.1007/s00607-021-00932-y
Mohamed Aymen Ben HajKacem, Chiheb-Eddine Ben N’cir, Nadia Essoussi

Clustering textual data has become an important task in data analytics since several applications require to automatically organizing large amounts of textual documents into homogeneous topics. The increasing growth of available textual data from web, social networks and open platforms have challenged this task. It becomes important to design scalable clustering method able to effectively organize huge amount of textual data into topics. In this context, we propose a new parallel text clustering method based on Spark framework and hashing. The proposed method deals simultaneously with the issue of clustering huge amount of documents and the issue of high dimensionality of textual data by respectively integrating the divide and conquer approach and implementing a new document hashing strategy. These two facts have shown an important improvement of scalability and a good approximation of clustering quality results. Experiments performed on several large collections of documents have shown the effectiveness of the proposed method compared to existing ones in terms of running time and clustering accuracy.



中文翻译:

使用Spark和哈希的并行文本聚类方法

由于多个应用程序需要将大量文本文档自动组织为同类主题,因此将文本数据聚类已成为数据分析中的重要任务。来自网络,社交网络和开放平台的可用文本数据的增长不断增长,对这一任务提出了挑战。设计能够有效地将大量文本数据组织到主题中的可伸缩聚类方法变得很重要。在这种情况下,我们提出了一种新的基于Spark框架和哈希的并行文本聚类方法。通过分别整合分治法和实现一种新的文档哈希策略,该方法同时解决了海量文档的聚类问题和文本数据的高维问题。这两个事实表明可伸缩性得到了重要的改进,并且聚类质量结果得到了很好的近似。在大量文档集合上进行的实验表明,与现有方法相比,该方法在运行时间和聚类精度方面是有效的。

更新日期:2021-04-08
全部期刊列表>>
2021中国学者有奖调研
JACS
材料科学跨学科高质量前沿研究
中国作者高影响力研究精选
虚拟特刊
屿渡论文,编辑服务
何川
清华大学
郭维
上海中医药大学
华东师范大学
北京大学许言
楚甲祥
西湖石航
上海交大
北理工
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
南开大学
张韶光
华辉
天合科研
x-mol收录
试剂库存
down
wechat
bug