当前位置: X-MOL 学术Nat. Lang. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Effective multi-dialectal arabic POS tagging
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-04-14 , DOI: 10.1017/s1351324920000078
Kareem Darwish , Mohammed Attia , Hamdy Mubarak , Younes Samih , Ahmed Abdelali , Lluís Màrquez , Mohamed Eldesouki , Laura Kallmeyer

This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect.

中文翻译:

有效的多方言阿拉伯语词性标注

这项工作引入了强大的多方言词性标记,该标签在四个主要方言组中的阿拉伯语推文的注释数据集上进行训练:埃及语、黎凡特语、海湾语和马格里布语。我们实现了两种不同的序列标记方法。第一个使用条件随机场 (CRF),而第二个将深度神经网络中基于单词和字符的表示与卷积和循环网络的堆叠层与 CRF 输出层相结合。我们成功地利用了有助于概括我们的模型的各种特征,例如布朗簇和词干模板。此外,我们开发了强大的联合模型来标记多方言推文并优于单方言标记器。我们在所有方言中实现了 92.4% 的综合准确率,每种方言的结果在 90.2% 到 95.4% 之间。
更新日期:2020-04-14
down
wechat
bug