Effect of stemming on text similarity for Arabic language at sentence level,PeerJ Computer Science

当前位置： X-MOL 学术 › PeerJ Comput. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Effect of stemming on text similarity for Arabic language at sentence level
PeerJ Computer Science ( IF 3.5 ) Pub Date : 2021-05-14 , DOI: 10.7717/peerj-cs.530
Mohammad O. Alhawarat ₁ , Hikmat Abdeljaber ₁ , Anwer Hilal ₂

Affiliation

Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.

中文翻译：

词干对句子级别阿拉伯语文本相似性的影响

语义文本相似性（STS）在自然语言处理（NLP）领域中具有多个重要应用。这项研究的目的是在句子级别研究词干对阿拉伯语文本相似性的影响。这项研究使用了几种阿拉伯语的轻重词干和词干化算法，总共有10种算法。SemEval-2017国际研讨会使用了标准的培训和测试数据集，用于任务1，第1轨阿拉伯语（ar–ar）。基于不同的相似性度量，选择不同的特征来研究词干对文本相似性的影响。使用传统的机器学习算法，例如支持向量机（SVM），随机梯度下降（SGD）和朴素贝叶斯（NB）。与原文相比，在实验中使用词干和词根化的文档可获得增强的Pearson相关性结果。使用阿拉伯语轻型茎秆（ARLSTem）和Farasa轻型茎秆，Farasa和Qalsadi Lemmatizers和Tashaphyne重型茎秆，可获得最佳结果。皮尔森相关性的最佳增强约为7.34％。通常，类似阿拉伯语言，使用词干可以显着提高句子文本的性能。但是，有些词干使结果差于原始文本。它们是Khoja重磅茎和AlKhalil轻茎。词干的改进与阿拉伯语类似，大大提高了句子文本的性能。但是，有些词干使结果差于原始文本。它们是Khoja重磅茎和AlKhalil轻茎。词干的改进与阿拉伯语类似，大大提高了句子文本的性能。但是，有些词干使结果差于原始文本。它们是Khoja重磅茎和AlKhalil轻茎。

更新日期：2021-05-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文