Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM,Computer Speech & Language

当前位置： X-MOL 学术 › Comput. Speech Lang › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM
Computer Speech & Language ( IF 4.3 ) Pub Date : 2020-07-31 , DOI: 10.1016/j.csl.2020.101138
Wasan AlKhwiter , Nora Al-Twairesh

Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. This content has been a rich source for several studies that focused on natural language processing (NLP) research. However, Twitter data pose numerous challenges and obstacles to NLP tasks. For the English language, Twitter has an NLP tool that provides tweet-specific NLP tasks, which present significant opportunities for English NLP research and applications. Part-of-speech (POS) tagging for English tweets is one of the tasks that is offered and facilitated by such a tool. In contrast, only a few attempts have been made to develop POS taggers for Arabic content on Twitter. In this paper, we consider POS tagging, which is one of the NLP tasks that directly affects the performance of other subsequent text processing tasks. We introduce three manually annotated datasets for the POS tagging of Arabic tweets: the ‘Mixed,’ ‘MSA,’ and ‘GLF’ datasets with 3000, 1000, and 1000 Arabic tweets, respectively. In addition, we present an exploratory analysis of the behavior of using hashtags in Arabic tweets, which is a phenomenon that affects the task of POS tagging. We also present two supervised POS taggers that are developed based on two approaches: Conditional Random Fields and Bidirectional Long Short-Term Memory (Bi-LSTM) models. We conclude that the Bi-LSTM-based POS tagger achieves the state-of-the-art results for the ‘Mixed’ dataset with 96.5% accuracy. However, the specific-dialect taggers trained on the ‘MSA’ and ‘GLF’ datasets achieve an accuracy of 95.6% and 95%, respectively. The results for the ‘Mixed’ dataset indicate the effectiveness of developing a joint POS tagger without the need for a dialect-specific POS tagger.

中文翻译：

使用CRF和Bi-LSTM的阿拉伯语推文的词性标记

在过去的几年中，Twitter经历了巨大的增长，其在线内容的数量迅速增加。对于一些专注于自然语言处理（NLP）研究的研究，此内容已成为其丰富的资料来源。但是，Twitter数据给NLP任务带来了许多挑战和障碍。对于英语，Twitter有一个NLP工具，可提供特定于鸣叫的NLP任务，这为英语NLP研究和应用提供了巨大的机会。英文推文的词性（POS）标记是这种工具提供和促进的任务之一。相比之下，仅进行了很少的尝试来为Twitter上的阿拉伯语内容开发POS标记器。在本文中，我们考虑POS标记，这是NLP任务之一，直接影响其他后续文本处理任务的性能。我们为阿拉伯语推文引入了三个手动注释的数据集：“混合”，“ MSA”和“ GLF”数据集，分别具有3000、1000和1000个阿拉伯语推文。另外，我们对阿拉伯语推文中使用主题标签的行为进行了探索性分析，这种现象会影响POS标签的任务。我们还介绍了基于两种方法开发的两个监督POS标记器：条件随机字段和双向长短期记忆（Bi-LSTM）模型。我们得出结论，基于Bi-LSTM的POS标记器以96.5％的准确度实现了“混合”数据集的最新结果。但是，在“ MSA”和“ GLF”数据集上训练的特定方向标记器的准确率分别为95.6％和95％。

更新日期：2020-08-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>