当前位置: X-MOL 学术IEEE Access › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards Authorship Attribution in Arabic Short-Microblog Text
IEEE Access ( IF 3.9 ) Pub Date : 2021-09-13 , DOI: 10.1109/access.2021.3112624
Kamal Mansour Jambi , Imtiaz Hussain Khan , Muazzam Ahmed Siddiqui , Salma Omar Alhaj

Authorship attribution is the study to identify individuals by their writing styles without knowing their actual identities. This is a challenging task in natural language processing. Most work on authorship attribution focused on English, whereas, the problem is understudied in Arabic language. However, due to the complex and distinct morphological nature of the Arabic language, techniques developed for English are not directly applicable to Arabic. This paper explored the possibility of using state-of-the-art classifiers, Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Random Forest, to predict authorship in Arabic short-microblog text. We employed three commonly used linguistic features, character-, lexical- and syntactic-based, in an incremental manner to predict the accuracy of the selected classifiers. The results elucidate that a systematic combination of linguistic features improves authorship classification. However, an inverse correlation was observed in authorship classification accuracy and the number of authors. Overall, SVM and Random Forest classifier are comparable and attained ~65% accuracy, whereas KNN hardly attained ~35% accuracy. In addition, lexical features offer more discriminatory power as compared to the character and syntactic features.

中文翻译:

阿拉伯语短微博文本中的作者归属问题

作者归属是通过写作风格来识别个人而不知道他们的实际身份的研究。这是自然语言处理中的一项具有挑战性的任务。大多数关于作者归属的工作都集中在英语上,而阿拉伯语中的问题没有得到充分研究。然而,由于阿拉伯语复杂而独特的形态学性质,为英语开发的技术不能直接适用于阿拉伯语。本文探讨了使用最先进的分类器、支持向量机 (SVM)、K-最近邻 (KNN) 和随机森林来预测阿拉伯语短微博文本的作者身份的可能性。我们采用三种常用的语言特征,基于字符、基于词汇和基于句法,以增量方式预测所选分类器的准确性。结果表明,语言特征的系统组合改进了作者身份分类。然而,在作者分类准确性和作者数量方面观察到负相关。总体而言,SVM 和随机森林分类器具有可比性并达到了约 65% 的准确率,而 KNN 几乎没有达到约 35% 的准确率。此外,与字符和句法特征相比,词汇特征具有更强的辨别力。
更新日期:2021-09-24
down
wechat
bug