WEClustering: word embeddings based text clustering technique for large datasets,Complex & Intelligent Systems

当前位置： X-MOL 学术 › Complex Intell. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

WEClustering: word embeddings based text clustering technique for large datasets
Complex & Intelligent Systems ( IF 5.0 ) Pub Date : 2021-09-07 , DOI: 10.1007/s40747-021-00512-9
Vivek Mehta ₁ , Seema Bawa ₁ , Jasmeet Singh ₁

Affiliation

A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.

中文翻译：

WEClustering：基于词嵌入的大型数据集文本聚类技术

大量的文本数据现在以研究文章、新闻文章、评论、维基百科文章和书籍等形式存在于数字存储库中。文本聚类是执行分类、主题提取和信息检索的基本数据挖掘技术。文本数据集，尤其是包含大量文档的数据集是稀疏的并且具有高维度。因此，传统的聚类技术，如 K-means、凝聚聚类和 DBSCAN 不能很好地执行。在本文中，提出了一种特别适用于大型文本数据集的聚类技术，以克服这些限制。所提出的技术基于从最近的深度学习模型“使用变压器的双向编码器表示”派生的词嵌入。所提出的技术被命名为WEClustering。所提出的技术以有效的方式处理高维问题，因此形成了更准确的聚类。该技术在多个不同大小的数据集上进行了验证，并将其性能与其他广泛使用的最先进的聚类技术进行了比较。实验比较表明，通过纯度和调整兰德指数等指标衡量，所提出的聚类技术比其他技术有显着改进。

更新日期：2021-09-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11