当前位置: X-MOL 学术PeerJ Comput. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
PeerJ Computer Science ( IF 3.8 ) Pub Date : 2021-05-20 , DOI: 10.7717/peerj-cs.510
Latifah Almuqren 1, 2 , Alexandra Cristea 1
Affiliation  

Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper presents how we have constructed, cleaned, pre-processed, and annotated our 20,0000 Gold Standard Corpus (GSC) AraCust, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA). AraCust contains Saudi dialect tweets, processed from a self-collected Arabic tweets dataset and has been annotated for sentiment analysis, i.e.,manually labelled (k=0.60). In addition, we have illustrated AraCust’s power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of our corpus, to assist with choosing the right ASA methods for it. To evaluate our Golden Standard corpus AraCust, we have first applied a simple experiment, using a supervised classifier, to offer benchmark outcomes for forthcoming works. In addition, we have applied the same supervised classifier on a publicly available Arabic dataset created from Twitter, ASTD (Nabil, Aly & Atiya, 2015). The result shows that our dataset AraCust outperforms the ASTD result with 91% accuracy and 89% F1avg score. The AraCust corpus will be released, together with code useful for its exploration, via GitHub as a part of this submission.

中文翻译:

AraCust:沙特电信推文语料库,用于情感分析

将阿拉伯语与其他语言进行比较,阿拉伯语缺乏用于自然语言处理的大型语料库(Assiri,Emam和Al-Dossari,2018年; Gamal等人,2019年)。许多学者依靠从一种语言到另一种语言的翻译来构建自己的语料库(Rushdi-Saleh等,2011)。本文介绍了我们如何构建,清理,预处理和注释我们的20,0000黄金标准语料库(GSC)AraCust,这是第一款用于方言阿拉伯语(DA)的阿拉伯语情感分析(ASA)的电信GSC。AraCust包含从自收集的阿拉伯语推文数据集处理的沙特方言推文,并带有注释以进行情感分析,即手动标记(k = 0.60)。此外,我们通过进行探索性数据分析来说明AraCust的功能,以分析源自我们语料库性质的功能,协助选择正确的ASA方法。为了评估我们的金标准语料库AraCust,我们首先应用了一个简单的实验,使用监督分类器,为即将发表的作品提供基准结果。此外,我们在Twitter,ASTD创建的可公开获取的阿拉伯数据集上应用了相同的监督分类器(Nabil,Aly和Atiya,2015年)。结果表明,我们的数据集AraCust以91%的准确性和89%的F1avg得分优于ASTD结果。作为本次提交的一部分,AraCust语料库将与GitHub一起发布,并且对探索有用的代码也将有所帮助。我们在Twitter,ASTD创建的可公开获取的阿拉伯数据集上应用了相同的监督分类器(Nabil,Aly&Atiya,2015年)。结果表明,我们的数据集AraCust以91%的准确性和89%的F1avg得分优于ASTD结果。作为提交的一部分,AraCust语料库将与GitHub一起发布,并且对探索有用的代码也将发布。我们在Twitter,ASTD创建的可公开获取的阿拉伯数据集上应用了相同的监督分类器(Nabil,Aly&Atiya,2015年)。结果表明,我们的数据集AraCust以91%的准确性和89%的F1avg得分优于ASTD结果。作为提交的一部分,AraCust语料库将与GitHub一起发布,并且对探索有用的代码也将发布。
更新日期:2021-05-20
down
wechat
bug