Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM

https://doi.org/10.1016/j.csl.2020.101138Get rights and content

Highlights

  • POS taggers are developed for MSA and GLF variants of the Arabic language using CRF and BiLSTM.

  • The gold standard annotated datasets that have been constructed for POS tagging are made accessible to the research community.

  • An exploratory analysis of the behavior of using hashtags in Arabic tweets is presented, and this can be leveraged in future studies.

  • The POS tagger for Arabic tweets using the BiLSTM achieves the best performance.

  • Experiments show that there is no need for a dialect specific POS tagger.

Abstract

Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. This content has been a rich source for several studies that focused on natural language processing (NLP) research. However, Twitter data pose numerous challenges and obstacles to NLP tasks. For the English language, Twitter has an NLP tool that provides tweet-specific NLP tasks, which present significant opportunities for English NLP research and applications. Part-of-speech (POS) tagging for English tweets is one of the tasks that is offered and facilitated by such a tool. In contrast, only a few attempts have been made to develop POS taggers for Arabic content on Twitter. In this paper, we consider POS tagging, which is one of the NLP tasks that directly affects the performance of other subsequent text processing tasks. We introduce three manually annotated datasets for the POS tagging of Arabic tweets: the ‘Mixed,’ ‘MSA,’ and ‘GLF’ datasets with 3000, 1000, and 1000 Arabic tweets, respectively. In addition, we present an exploratory analysis of the behavior of using hashtags in Arabic tweets, which is a phenomenon that affects the task of POS tagging. We also present two supervised POS taggers that are developed based on two approaches: Conditional Random Fields and Bidirectional Long Short-Term Memory (Bi-LSTM) models. We conclude that the Bi-LSTM-based POS tagger achieves the state-of-the-art results for the ‘Mixed’ dataset with 96.5% accuracy. However, the specific-dialect taggers trained on the ‘MSA’ and ‘GLF’ datasets achieve an accuracy of 95.6% and 95%, respectively. The results for the ‘Mixed’ dataset indicate the effectiveness of developing a joint POS tagger without the need for a dialect-specific POS tagger.

Introduction

Natural language processing (NLP) involves several tasks and applications. Part-of-speech (POS) tagging is one of the first processes that directly affect the performance of other subsequent text processing tasks in NLP applications (Albared et al., 2011). The performance of most NLP tasks and applications depends on the genre of the text being processed. Recently, the popular microblogging service “Twitter” has experienced a significant growth rate in the last few years; where it encourages people to post millions of messages. Thus, Twitter is a rich and fruitful source of data to study the evolution of various issues. However, Twitter data pose numerous challenges due to the nature of text in microblogs, such as the restricted length, which leads to a substantial number of abbreviations, and noisy and informal content. In addition, grammar and correct spellings are usually not properly adhered to in Twitter (Farghaly and Shaalan, 2009). Hence, processing Twitter data differ from other genres of text.

Twitter-based POS taggers and NLP tools provide POS tagging for the English language, and this presents significant opportunities for English NLP research and applications. In contrast, the lack of Twitter-based POS taggers for Arabic is a clear result of the lack of Arabic annotated datasets for POS tagging. To date, only a few studies have investigated this problem and developed POS taggers for Arabic tweets (Al-Sabbagh and Girju, 2012) (Albogamy and Ramasy, 2015) (Darwish et al., 2018) (Alharbi et al., 2018). Furthermore, although the problem of POS tagging has been solved using different approaches in literature, deep learning-based studies are still relatively scarce, and hence, the use of deep learning approaches is explored for this task in this paper.

The Arabic language belongs to the Semitic language family, and is the official language of more than twenty countries in Africa and the Middle East. It is considered to be the fourth most used language on the web.1The Arabic language has different variants, which are: Classical Arabic (CA), Modern Standard Arabic (MSA), and Colloquial or Dialectal Arabic (DA). CA was the language used in ancient days, and MSA is the primary written language used by the media and in education, these days. DA, however, is the everyday spoken language that exists in different varieties according to the country or Arab region the speaker is from. DA involves different dialects based on geographical locations in Arab countries (El-Beltagy and Ali, 2013). Therefore, DA varies geographically and socially and is not standardized (Zitouni, 2016). Dialects differ from MSA phonologically, morphologically and syntactically (Habash, 2010). Moreover, dialects do not have standard orthographies. This makes the task of building morphological analyzers and POS taggers for dialects a big challenge. Hence, these varieties of the Arabic language require advanced processing for Arabic text. Until recently, DA was mostly spoken and was never found in written form. The proliferation of social media has changed this trend, as Arab users tend to use DA in these new venues. Hence, DA is now also found in written form. This paper focuses on one of the Arabic dialects, namely Gulf (GLF). According to Habash (Habash, 2010), GLF Arabic includes the dialects of Kuwait, United Arab Emirates, Bahrain, Qatar, and Saudi Arabia.

In this paper, we aim to build POS taggers for tweets written in both MSA and the GLF dialect. We present two tagging models by using Conditional Random Fields (CRFs) and Bidirectional Long Short-Term Memory (Bi-LSTM). To train these models, we have constructed three datasets: ‘Mixed,’ ‘MSA,’ and ‘GLF’ with 3000, 1000, and 1000 Arabic tweets, respectively. A series of experiments were conducted to evaluate our trained models. We also investigated the effect of features derived from the morphological analyzer MADAMIRA (Pasha et al., 2014) on the tagging performance. To the best of our knowledge, the ‘Mixed’ dataset that has 3000 Arabic tweets is, till now, the largest annotated dataset extracted from Twitter for POS tagging.

The contributions of this paper are as follows:

  • We present a POS tagger for Arabic tweets using a deep learning approach that achieves a state-of-the-art performance.

  • POS taggers are developed for MSA and GLF variants of the Arabic language.

  • The gold standard annotated datasets that have been constructed for POS tagging are made accessible to the research community.

  • We present an exploratory analysis of the behavior of using hashtags in Arabic tweets, and this can be leveraged in future studies.

The paper is structured as follows. Section 2 presents related work. Section 3 introduces the dataset for POS tagging for Arabic tweets. Section 4 presents hashtag analysis. Section 5 demonstrates the adopted features. Section 6 presents the tagging methods. Section 7 describes experiments. Section 8 discusses the obtained results and presents error analysis. Finally, Section 9 concludes the paper and discusses future work.

Section snippets

Related work

POS tagging is a well-studied problem in NLP over the past decades. Several studies have been conducted to develop POS taggers that are tailored for social media text. Gimpel et al. (2011) presented one of the preliminary POS tagging methods for English tweets included in a web-based CMU Twitter NLP toolkit (ArkNLP). They developed a tagset of 25 tags and used it to annotate a corpus consisting of 1827 tweets (26,436 tokens). The corpus was divided into training/development/test sets of

Datasets

This section presents the datasets that have been constructed for the POS tagging of Arabic tweets. We first describe the annotation process, including how the data was collected and annotated, and present the adopted tagset in our POS tagging process. We then present statistical information on the datasets.

Hashtag analysis

Twitter users can use hashtags in different ways. In fact, the use of hashtags differs across cultures. For example, in English, underscores are not used, and hashtags with more than one word are written by capitalizing each word, such as #FirstSecond. In contrast, in Arabic, the underscores are inserted to delimit the parts of the word. It was observed that there were many hashtags in our dataset that are used as a part of the text in tweets and this affected their suitability to be tagged as

Features

Since, the MADARi (Obeid et al., 2018) annotation interface, which runs the morphological analyzer MADAMIRA (Pasha et al., 2014) in its core, was used; the form (POS.Features) was obtained using MADARi through automatic tagging. These features refer to specific morpho-syntactic aspects of the word. They represent aspect, person, gender, and number. Each feature has several values as follows:

  • Aspect: with the values Perfective (P), Imperfective (I) and Command (C).

  • Person: with the values 1st (1),

CRFs

Conditional Random Fields CRFs (Lafferty et al., 2001) have proven to achieve state-of-the-art performance in several sequence labeling tasks. CRFs estimate the probabilities of possible label sequences for a given observation sequence. A previous study has shown that the CRF-based tagger that was developed for English tweets achieved high accuracy (Gimpel et al., 2011). Hence, we used this sequence labeling method on our gold standard dataset which is much larger than the corpus of the

Experimental setting

We divided all three datasets, namely, ‘Mixed,’ ‘MSA,’ and ‘GLF’ into 80/10/10 train, development, and test sets, respectively. The test set consisted of tweets that were not used to set up the model, while the train and development sets were used to tune the classifier to the optimal value of the parameters that reported the best performance.

Our evaluation tested the efficiency of the proposed models for the task of POS tagging. We have experimentally evaluated the performance of the two

CRF model results

In these set of experiments, the effectiveness of the CRF approach was evaluated for POS tagging on the development sets of the three annotated datasets, namely ‘Mixed,’ ‘MSA,’ and ‘GLF’

Conclusion

We have introduced new datasets for POS tagging that are constructed from Arabic tweets. A supervised approach is used to train two different models, namely CRF and Bi-LSTM on such annotated datasets. It is shown that the proposed Bi-LSTM-based POS tagger achieves the state-of-the-art results over the existing dialect-specific models with 96.5% accuracy on a ‘Mixed’ dataset of 3000 tweets. However, the specific-dialect taggers for MSA and Gulf achieve an accuracy of 95.6% and 95%, respectively

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We would like to thank Nizar Habash, Salam Khalifa, and Osama Obeid for providing us with access to MADARi for use in the annotation process. This research project was supported by a grant from the “Research Center of the Female Scientific and Medical Colleges,” Deanship of Scientific Research, King Saud University. The authors thank the Deanship of Scientific Research and RSSU at King Saud University for their technical support.

References (32)

  • R. Al-Sabbagh et al.

    A Supervised POS Tagger for Written Arabic Social Networking Corpora

  • N. Al-Twairesh et al.

    Arabic Spam Detection in Twitter

  • M. Albared et al.

    Developing a Competitive HMM Arabic POS Tagger Using Small Training Corpora

  • F. Albogamy et al.

    Towards POS Tagging for Arabic Tweets

  • F. Albogamy et al.

    Fast and Robust POS tagger for Arabic Tweets Using Agreement-based Bootstrapping

  • F. Albogamy et al.

    POS Tagging for Arabic Tweets

  • R. Alharbi et al.

    Part-of-Speech Tagging for Arabic Gulf Dialect Using Bi-LSTM

  • CoNLL-U Format[WWW Document], n.d. URLhttps://universaldependencies.org/format.html(accessed...
  • CRF++: Yet Another CRF toolkit[WWW Document], n.d. URLhttps://taku910.github.io/crfpp/(accessed...
  • K. Darwish et al.

    Multi-Dialect Arabic POS Tagging: a CRF Approach

  • L. Derczynski et al.

    Twitter Part-of-Speech Tagging for All: overcoming Sparse and Noisy Data

  • M.T. Diab

    Second Generation AMIRA Tools for Arabic Processing: fast and Robust Tokenization, POS tagging, and Base Phrase Chunking

  • S. El-Beltagy et al.

    Open issues in the sentiment analysis of Arabic social media: a case study

  • A. Farghaly et al.

    Arabic Natural Language Processing: challenges and Solutions

  • R. Garside et al.

    Corpus annotation: Linguistic Information from Computer Text Corpora

    (1997)
  • K. Gimpel et al.

    Part-of-Speech Tagging for Twitter: annotation, Features, and Experiments

  • Cited by (41)

    • A novel method for signal labeling and precise location in a variable parameter milling process based on the stacked-BiLSTM-CRF and FLOSS

      2023, Advanced Engineering Informatics
      Citation Excerpt :

      To solve the sequence labeling problem, conventional statistical learning methods such as the hidden Markov model (HMM) [26–28] and conditional random field model (CRF) [29,30] are commonly used. In addition, deep learning methods such as the long short-term memory network (LSTM) [31,32] have also been widely applied to solve the sequence labeling problem. Due to the unique design of its gate structures, LSTM has a powerful extraction ability for time series features.

    • Identification of cyber harassment and intention of target users on social media platforms

      2022, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      According to the experiment’s findings, The CNN model and each feature worked together to get the optimum outcome and produced the best F1 score of 0.817 at p 0.01, which was much higher than that of any other model. AlKhwiter and Al-Twairesh (2021) presented two supervised POS taggers that were created using Bidirectional Long Short-Term Memory (Bi- LSTM) models with conditional random fields. The accuracy of the Bi-LSTM-based POS tagger is 96.5 per cent.

    • AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling

      2022, Information Processing and Management
      Citation Excerpt :

      The results have shown that training a joint multi-dialectal POS tagging model outperforms its uni-dialectal counterpart. AlKhwiter and Al-Twairesh (2021) have investigated the POS tagging of Arabic tweets using both the CRF and bidirectional Long Short-Term Memory (LSTM) approaches. Their work has focused on MSA and the Gulf dialect.

    • Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers

      2024, ACM Transactions on Asian and Low-Resource Language Information Processing
    View all citing articles on Scopus
    View full text