Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM

doi:10.1016/j.csl.2020.101138

Computer Speech & Language

Volume 65, January 2021, 101138

https://doi.org/10.1016/j.csl.2020.101138 Get rights and content

Highlights

•
POS taggers are developed for MSA and GLF variants of the Arabic language using CRF and BiLSTM.
•
The gold standard annotated datasets that have been constructed for POS tagging are made accessible to the research community.
•
An exploratory analysis of the behavior of using hashtags in Arabic tweets is presented, and this can be leveraged in future studies.
•
The POS tagger for Arabic tweets using the BiLSTM achieves the best performance.
•
Experiments show that there is no need for a dialect specific POS tagger.

Abstract

Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. This content has been a rich source for several studies that focused on natural language processing (NLP) research. However, Twitter data pose numerous challenges and obstacles to NLP tasks. For the English language, Twitter has an NLP tool that provides tweet-specific NLP tasks, which present significant opportunities for English NLP research and applications. Part-of-speech (POS) tagging for English tweets is one of the tasks that is offered and facilitated by such a tool. In contrast, only a few attempts have been made to develop POS taggers for Arabic content on Twitter. In this paper, we consider POS tagging, which is one of the NLP tasks that directly affects the performance of other subsequent text processing tasks. We introduce three manually annotated datasets for the POS tagging of Arabic tweets: the ‘Mixed,’ ‘MSA,’ and ‘GLF’ datasets with 3000, 1000, and 1000 Arabic tweets, respectively. In addition, we present an exploratory analysis of the behavior of using hashtags in Arabic tweets, which is a phenomenon that affects the task of POS tagging. We also present two supervised POS taggers that are developed based on two approaches: Conditional Random Fields and Bidirectional Long Short-Term Memory (Bi-LSTM) models. We conclude that the Bi-LSTM-based POS tagger achieves the state-of-the-art results for the ‘Mixed’ dataset with 96.5% accuracy. However, the specific-dialect taggers trained on the ‘MSA’ and ‘GLF’ datasets achieve an accuracy of 95.6% and 95%, respectively. The results for the ‘Mixed’ dataset indicate the effectiveness of developing a joint POS tagger without the need for a dialect-specific POS tagger.

Introduction

Natural language processing (NLP) involves several tasks and applications. Part-of-speech (POS) tagging is one of the first processes that directly affect the performance of other subsequent text processing tasks in NLP applications (Albared et al., 2011). The performance of most NLP tasks and applications depends on the genre of the text being processed. Recently, the popular microblogging service “Twitter” has experienced a significant growth rate in the last few years; where it encourages people to post millions of messages. Thus, Twitter is a rich and fruitful source of data to study the evolution of various issues. However, Twitter data pose numerous challenges due to the nature of text in microblogs, such as the restricted length, which leads to a substantial number of abbreviations, and noisy and informal content. In addition, grammar and correct spellings are usually not properly adhered to in Twitter (Farghaly and Shaalan, 2009). Hence, processing Twitter data differ from other genres of text.

Twitter-based POS taggers and NLP tools provide POS tagging for the English language, and this presents significant opportunities for English NLP research and applications. In contrast, the lack of Twitter-based POS taggers for Arabic is a clear result of the lack of Arabic annotated datasets for POS tagging. To date, only a few studies have investigated this problem and developed POS taggers for Arabic tweets (Al-Sabbagh and Girju, 2012) (Albogamy and Ramasy, 2015) (Darwish et al., 2018) (Alharbi et al., 2018). Furthermore, although the problem of POS tagging has been solved using different approaches in literature, deep learning-based studies are still relatively scarce, and hence, the use of deep learning approaches is explored for this task in this paper.

The Arabic language belongs to the Semitic language family, and is the official language of more than twenty countries in Africa and the Middle East. It is considered to be the fourth most used language on the web.¹The Arabic language has different variants, which are: Classical Arabic (CA), Modern Standard Arabic (MSA), and Colloquial or Dialectal Arabic (DA). CA was the language used in ancient days, and MSA is the primary written language used by the media and in education, these days. DA, however, is the everyday spoken language that exists in different varieties according to the country or Arab region the speaker is from. DA involves different dialects based on geographical locations in Arab countries (El-Beltagy and Ali, 2013). Therefore, DA varies geographically and socially and is not standardized (Zitouni, 2016). Dialects differ from MSA phonologically, morphologically and syntactically (Habash, 2010). Moreover, dialects do not have standard orthographies. This makes the task of building morphological analyzers and POS taggers for dialects a big challenge. Hence, these varieties of the Arabic language require advanced processing for Arabic text. Until recently, DA was mostly spoken and was never found in written form. The proliferation of social media has changed this trend, as Arab users tend to use DA in these new venues. Hence, DA is now also found in written form. This paper focuses on one of the Arabic dialects, namely Gulf (GLF). According to Habash (Habash, 2010), GLF Arabic includes the dialects of Kuwait, United Arab Emirates, Bahrain, Qatar, and Saudi Arabia.

In this paper, we aim to build POS taggers for tweets written in both MSA and the GLF dialect. We present two tagging models by using Conditional Random Fields (CRFs) and Bidirectional Long Short-Term Memory (Bi-LSTM). To train these models, we have constructed three datasets: ‘Mixed,’ ‘MSA,’ and ‘GLF’ with 3000, 1000, and 1000 Arabic tweets, respectively. A series of experiments were conducted to evaluate our trained models. We also investigated the effect of features derived from the morphological analyzer MADAMIRA (Pasha et al., 2014) on the tagging performance. To the best of our knowledge, the ‘Mixed’ dataset that has 3000 Arabic tweets is, till now, the largest annotated dataset extracted from Twitter for POS tagging.

The contributions of this paper are as follows:

•
We present a POS tagger for Arabic tweets using a deep learning approach that achieves a state-of-the-art performance.
•
POS taggers are developed for MSA and GLF variants of the Arabic language.
•
The gold standard annotated datasets that have been constructed for POS tagging are made accessible to the research community.
•
We present an exploratory analysis of the behavior of using hashtags in Arabic tweets, and this can be leveraged in future studies.

The paper is structured as follows. Section 2 presents related work. Section 3 introduces the dataset for POS tagging for Arabic tweets. Section 4 presents hashtag analysis. Section 5 demonstrates the adopted features. Section 6 presents the tagging methods. Section 7 describes experiments. Section 8 discusses the obtained results and presents error analysis. Finally, Section 9 concludes the paper and discusses future work.

Section snippets

Related work

POS tagging is a well-studied problem in NLP over the past decades. Several studies have been conducted to develop POS taggers that are tailored for social media text. Gimpel et al. (2011) presented one of the preliminary POS tagging methods for English tweets included in a web-based CMU Twitter NLP toolkit (ArkNLP). They developed a tagset of 25 tags and used it to annotate a corpus consisting of 1827 tweets (26,436 tokens). The corpus was divided into training/development/test sets of

Datasets

This section presents the datasets that have been constructed for the POS tagging of Arabic tweets. We first describe the annotation process, including how the data was collected and annotated, and present the adopted tagset in our POS tagging process. We then present statistical information on the datasets.

Hashtag analysis

Twitter users can use hashtags in different ways. In fact, the use of hashtags differs across cultures. For example, in English, underscores are not used, and hashtags with more than one word are written by capitalizing each word, such as #FirstSecond. In contrast, in Arabic, the underscores are inserted to delimit the parts of the word. It was observed that there were many hashtags in our dataset that are used as a part of the text in tweets and this affected their suitability to be tagged as

Features

Since, the MADARi (Obeid et al., 2018) annotation interface, which runs the morphological analyzer MADAMIRA (Pasha et al., 2014) in its core, was used; the form (POS.Features) was obtained using MADARi through automatic tagging. These features refer to specific morpho-syntactic aspects of the word. They represent aspect, person, gender, and number. Each feature has several values as follows:

Aspect: with the values Perfective (P), Imperfective (I) and Command (C).
Person: with the values 1st (1),

CRFs

Conditional Random Fields CRFs (Lafferty et al., 2001) have proven to achieve state-of-the-art performance in several sequence labeling tasks. CRFs estimate the probabilities of possible label sequences for a given observation sequence. A previous study has shown that the CRF-based tagger that was developed for English tweets achieved high accuracy (Gimpel et al., 2011). Hence, we used this sequence labeling method on our gold standard dataset which is much larger than the corpus of the

Experimental setting

We divided all three datasets, namely, ‘Mixed,’ ‘MSA,’ and ‘GLF’ into 80/10/10 train, development, and test sets, respectively. The test set consisted of tweets that were not used to set up the model, while the train and development sets were used to tune the classifier to the optimal value of the parameters that reported the best performance.

Our evaluation tested the efficiency of the proposed models for the task of POS tagging. We have experimentally evaluated the performance of the two

CRF model results

In these set of experiments, the effectiveness of the CRF approach was evaluated for POS tagging on the development sets of the three annotated datasets, namely ‘Mixed,’ ‘MSA,’ and ‘GLF’

Conclusion

We have introduced new datasets for POS tagging that are constructed from Arabic tweets. A supervised approach is used to train two different models, namely CRF and Bi-LSTM on such annotated datasets. It is shown that the proposed Bi-LSTM-based POS tagger achieves the state-of-the-art results over the existing dialect-specific models with 96.5% accuracy on a ‘Mixed’ dataset of 3000 tweets. However, the specific-dialect taggers for MSA and Gulf achieve an accuracy of 95.6% and 95%, respectively

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We would like to thank Nizar Habash, Salam Khalifa, and Osama Obeid for providing us with access to MADARi for use in the annotation process. This research project was supported by a grant from the “Research Center of the Female Scientific and Medical Colleges,” Deanship of Scientific Research, King Saud University. The authors thank the Deanship of Scientific Research and RSSU at King Saud University for their technical support.

References (32)

R. Al-Sabbagh et al.
A Supervised POS Tagger for Written Arabic Social Networking Corpora
N. Al-Twairesh et al.
Arabic Spam Detection in Twitter
M. Albared et al.
Developing a Competitive HMM Arabic POS Tagger Using Small Training Corpora
F. Albogamy et al.
Towards POS Tagging for Arabic Tweets
F. Albogamy et al.
Fast and Robust POS tagger for Arabic Tweets Using Agreement-based Bootstrapping
F. Albogamy et al.
POS Tagging for Arabic Tweets
R. Alharbi et al.
Part-of-Speech Tagging for Arabic Gulf Dialect Using Bi-LSTM
CoNLL-U Format[WWW Document], n.d. URLhttps://universaldependencies.org/format.html(accessed...
CRF++: Yet Another CRF toolkit[WWW Document], n.d. URLhttps://taku910.github.io/crfpp/(accessed...
K. Darwish et al.
Multi-Dialect Arabic POS Tagging: a CRF Approach

L. Derczynski et al.

Twitter Part-of-Speech Tagging for All: overcoming Sparse and Noisy Data

M.T. Diab

Second Generation AMIRA Tools for Arabic Processing: fast and Robust Tokenization, POS tagging, and Base Phrase Chunking

S. El-Beltagy et al.

Open issues in the sentiment analysis of Arabic social media: a case study

A. Farghaly et al.

Arabic Natural Language Processing: challenges and Solutions

R. Garside et al.

Corpus annotation: Linguistic Information from Computer Text Corpora

(1997)

K. Gimpel et al.

Part-of-Speech Tagging for Twitter: annotation, Features, and Experiments

Cited by (41)

An effective approach for identifying keywords as high-quality filters to get emergency-implicated Twitter Spanish data
2024, Computer Speech and Language
Twitter has become a powerful knowledge source for data extraction for data mining projects due to the amount of data generated by its users, which allows researchers to find content of almost any topic in real time, but this depends on the quality of the keywords used, otherwise the extracted data will have a high percentage of irrelevant content. In this paper, we introduce a time-aware machine-learning-based approach to identify meaningful keywords to maximize the extraction of relevant emergency-related tweets when the Twitter API is used. We follow the CRISP-DM methodology. The first stage relies on problem understanding, where we detected the necessity of using meaningful keywords to filter content and extract data with more quality and reduce the percentage of irrelevant tweets. In the second stage, data collection, we used the official Twitter API to extract and label tweets as “emergencia” and “no emergencia”. After that, we analyzed the collected data (data understanding) to determine preprocessing techniques and to prepare the data for the model. Finally, in the modeling and testing stages, we trained a restricted Boltzmann machine and four variations of autoencoders, including an architecture proposed by a genetic algorithm, to use them as keyword identifiers and to determine which of them has the best performance to deploy it to production (deployment stage). The results show a slightly better performance of the autoencoder proposed by the genetic algorithm (GADAE), achieving a $R^{2}$ score of 0.97, a MAE of $14 \times 1 0^{- 3}$ , and a MSE of $4 \times 1 0^{- 4}$ . GADAE, the best model, managed to extract 110% more relevant tweets than manual filtering in the context of emergency-implicated tweets in Ecuador.
A novel method for signal labeling and precise location in a variable parameter milling process based on the stacked-BiLSTM-CRF and FLOSS
2023, Advanced Engineering Informatics
Citation Excerpt :
To solve the sequence labeling problem, conventional statistical learning methods such as the hidden Markov model (HMM) [26–28] and conditional random field model (CRF) [29,30] are commonly used. In addition, deep learning methods such as the long short-term memory network (LSTM) [31,32] have also been widely applied to solve the sequence labeling problem. Due to the unique design of its gate structures, LSTM has a powerful extraction ability for time series features.
Unlabeled time series signals collected during manufacturing typically have low value density and must be labeled and intercepted according to the specific application scenario. During variable-parameter milling, particularly high-precision machining, machining parameters vary, and associated discrepancies in vibration signals are small. In this scenario, signal features that are extracted by hand or via deep learning methods cannot typically distinguish machining states via classification models. To solve this problem, a sequence labeling model developed using a stacked bidirectional long short-term memory network with a conditional random field layer (stacked-BiLSTM-CRF) is proposed in this study to automatically label and intercept vibration signals. The stacked BiLSTM receives the shallow features obtained by the short-time Fourier transform of the vibration signals and then outputs the extracted deep features to capture the before and after dependence of the signals. The stacked BiLSTM is then extended by stacking a CRF layer to explicitly model the dependence of signal labels. In a more accurate labeling scenario, the fast low-cost online semantic segmentation algorithm (FLOSS) is used to acquire more fine-grained signal boundary locations after obtaining the frame-level signal label using the stacked BiLSTM-CRF model. In addition, to evaluate model performance, a novel evaluation index for signal labeling is proposed. The feasibility and effectiveness of the proposed method are verified using the vibration signals collected from variable parameter cutting experiments, and results show that the proposed model achieves the best labeling performance of tested methods in nearly all scenarios.
Identification of cyber harassment and intention of target users on social media platforms
2022, Engineering Applications of Artificial Intelligence
Citation Excerpt :
According to the experiment’s findings, The CNN model and each feature worked together to get the optimum outcome and produced the best F1 score of 0.817 at p 0.01, which was much higher than that of any other model. AlKhwiter and Al-Twairesh (2021) presented two supervised POS taggers that were created using Bidirectional Long Short-Term Memory (Bi- LSTM) models with conditional random fields. The accuracy of the Bi-LSTM-based POS tagger is 96.5 per cent.
Due to Coronavirus diseases in 2020, all the countries departed into lockdown to combat the spread of the pandemic situation. Schools and institutions remain closed and students’ screen time surged. The classes for the students are moved to the digital platform which leads to an increase in social media usage. Many children had become sufferers of cyber harassment which includes threatening comments on young students, sexual torture through a digital platform, people insulting one another, and the use of fake accounts to harass others. The rising effort on automated cyber harassment detection utilizes many AI-related components Natural language processing techniques and machine learning approaches. Though machine learning models using different algorithms fail to converge with higher accuracy, it is much more important to use significant natural language processes and efficient classifiers to detect cyberbullying comments on social media. In this proposed work, the lexical meaning of the text is analysed by the conventional scheme and the word order of the text is performed by the Fast Text model to improve the computational efficacy of the model. The intention of the text is analysed by various feature extraction methods. The score for intention detection is calculated using the frequency of words with a bully-victim participation score. Finally, the proposed model’s performance is measured by different evaluation metrics which illustrate that the accuracy of the model is higher than many other existing classification methods. The error rate is lesser for the detection model.
AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling
2022, Information Processing and Management
Citation Excerpt :
The results have shown that training a joint multi-dialectal POS tagging model outperforms its uni-dialectal counterpart. AlKhwiter and Al-Twairesh (2021) have investigated the POS tagging of Arabic tweets using both the CRF and bidirectional Long Short-Term Memory (LSTM) approaches. Their work has focused on MSA and the Gulf dialect.
Dialectal Arabic (DA) refers to varieties of everyday spoken languages in the Arab world. These dialects differ according to the country and region of the speaker, and their textual content is constantly growing with the rise of social media networks and web blogs. Although research on Natural Language Processing (NLP) on standard Arabic, namely Modern Standard Arabic (MSA), has witnessed remarkable progress, research efforts on DA are rather limited. This is due to numerous challenges, such as the scarcity of labeled data as well as the nature and structure of DA. While some recent works have reached decent results on several DA sentence classification tasks, other complex tasks, such as sequence labeling, still suffer from weak performances when it comes to DA varieties with either a limited amount of labeled data or unlabeled data only. Besides, it has been shown that zero-shot transfer learning from models trained on MSA does not perform well on DA. In this paper, we introduce AdaSL, a new unsupervised domain adaptation framework for Arabic multi-dialectal sequence labeling, leveraging unlabeled DA data, labeled MSA data, and existing multilingual and Arabic Pre-trained Language Models (PLMs). The proposed framework relies on four key components: (1) domain adaptive fine-tuning of multilingual/MSA language models on unlabeled DA data, (2) sub-word embedding pooling, (3) iterative self-training on unlabeled DA data, and (4) iterative DA and MSA distribution alignment. We evaluate our framework on multi-dialectal Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks.
The overall results show that the zero-shot transfer learning, using our proposed framework, boosts the performance of the multilingual PLMs by 40.87% in macro-F1 score for the NER task, while it boosts the accuracy by 6.95% for the POS tagging task. For the Arabic PLMs, our proposed framework increases performance by 16.18% macro-F1 for the NER task and 2.22% accuracy for the POS tagging task, and thus, achieving new state-of-the-art zero-shot transfer learning performance for Arabic multi-dialectal sequence labeling.
Normalized Orthography for Tunisian Arabic
2024, arXiv
Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers
2024, ACM Transactions on Asian and Low-Resource Language Information Processing

View all citing articles on Scopus

View full text