当前位置: X-MOL 学术Egypt. Inform. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An effective dimension reduction algorithm for clustering Arabic text
Egyptian Informatics Journal ( IF 5.0 ) Pub Date : 2019-05-31 , DOI: 10.1016/j.eij.2019.05.002
A.A. Mohamed

Text clustering is a challenging task in natural language processing due to the very high dimensional space produced by this process (i.e. curse of dimensionality problem). Since these texts contain considerable amounts of ambiguities and redundancies, they produce different noise effects. For an efficient and accurate clustering algorithm, we need to extract the main concepts of the text by eliminating the noise and reducing the high dimensionality of the data. This paper compares among three of the famous dimension reduction algorithms for text clustering to show the pros and cons of each one, namely Principal Component Analysis (PCA), Nonnegative Matrix Factorization (NMF) and Singular Value Decomposition (SVD). It presents an effective dimension reduction algorithm for Arabic text clustering using PCA. For that purpose, a series of the experiments has been conducted using two linguistic corpora for both English and Arabic and analyzed the results from a clustering quality point of view. The experiments have shown that PCA improves the quality of the clustering process and that it gives more interpretable results with less time needed for the clustering process for both Arabic and English documents.



中文翻译:

阿拉伯文本聚类的有效降维算法

文本聚类在自然语言处理中是一项具有挑战性的任务,因为此过程产生的维空间很大(即维数问题的诅咒)。由于这些文本包含大量的歧义和冗余,因此它们会产生不同的噪声效果。为了获得有效而准确的聚类算法,我们需要通过消除噪声并降低数据的高维性来提取文本的主要概念。本文对三种著名的文本聚类降维算法进行了比较,以显示每种算法的优缺点,分别是主成分分析(PCA),非负矩阵分解(NMF)和奇异值分解(SVD)。为使用PCA的阿拉伯文本聚类提出了一种有效的降维算法。为了这个目的,已经使用两种语言的语料库同时进行了英语和阿拉伯语的一系列实验,并从聚类质量的角度分析了结果。实验表明,PCA可以提高聚类过程的质量,并且可以提供阿拉伯语和英语文档的聚类过程所需时间更少的更多可解释结果。

更新日期:2019-05-31
down
wechat
bug