当前位置: X-MOL 学术Egypt. Inform. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Gender identification of egyptian dialect in twitter
Egyptian Informatics Journal ( IF 5.2 ) Pub Date : 2019-01-24 , DOI: 10.1016/j.eij.2018.12.002
Shereen Hussein , Mona Farouk , ElSayed Hemayed

Despite the widespread of social media among all age groups in Arabic countries, the research directed towards Author Profiling (AP) is still in its early stages. This paper provides an Egyptian Dialect Gender Annotated Dataset (EDGAD) obtained from Twitter as well as a proposed text classification solution for the Gender Identification (GI) problem. The dataset consists of 70,000 tweets per gender. In text classification, a Mixed Feature Vector (MFV) with different stylometric and Egyptian Arabic Dialect (EAD) language-specific features is proposed, in addition to N-Gram Feature Vector (NFV). Ensemble weighted average is applied to the Random Forest (RF) with MFV and Logistic Regression (LR) with NFV. The achieved gender identification accuracy is 87.6%.



中文翻译:

Twitter中埃及方言的性别识别

尽管社交媒体在阿拉伯国家的所有年龄段中都得到广泛使用,但针对作者个人档案(AP)的研究仍处于早期阶段。本文提供了从Twitter获得的埃及方言性别注释数据集(EDGAD),以及针对性别识别(GI)问题的拟议文本分类解决方案。数据集包含每性别70,000条推文。在文本分类中,除了N-Gram特征向量(NFV)之外,还提出了具有不同的笔势和埃及阿拉伯方言(EAD)语言特定特征的混合特征向量(MFV)。集合加权平均值应用于带有MFV的随机森林(RF)和带有NFV的Logistic回归(LR)。性别识别准确率达到87.6%。

更新日期:2019-01-24
down
wechat
bug