Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM
Introduction
Natural language processing (NLP) involves several tasks and applications. Part-of-speech (POS) tagging is one of the first processes that directly affect the performance of other subsequent text processing tasks in NLP applications (Albared et al., 2011). The performance of most NLP tasks and applications depends on the genre of the text being processed. Recently, the popular microblogging service “Twitter” has experienced a significant growth rate in the last few years; where it encourages people to post millions of messages. Thus, Twitter is a rich and fruitful source of data to study the evolution of various issues. However, Twitter data pose numerous challenges due to the nature of text in microblogs, such as the restricted length, which leads to a substantial number of abbreviations, and noisy and informal content. In addition, grammar and correct spellings are usually not properly adhered to in Twitter (Farghaly and Shaalan, 2009). Hence, processing Twitter data differ from other genres of text.
Twitter-based POS taggers and NLP tools provide POS tagging for the English language, and this presents significant opportunities for English NLP research and applications. In contrast, the lack of Twitter-based POS taggers for Arabic is a clear result of the lack of Arabic annotated datasets for POS tagging. To date, only a few studies have investigated this problem and developed POS taggers for Arabic tweets (Al-Sabbagh and Girju, 2012) (Albogamy and Ramasy, 2015) (Darwish et al., 2018) (Alharbi et al., 2018). Furthermore, although the problem of POS tagging has been solved using different approaches in literature, deep learning-based studies are still relatively scarce, and hence, the use of deep learning approaches is explored for this task in this paper.
The Arabic language belongs to the Semitic language family, and is the official language of more than twenty countries in Africa and the Middle East. It is considered to be the fourth most used language on the web.1The Arabic language has different variants, which are: Classical Arabic (CA), Modern Standard Arabic (MSA), and Colloquial or Dialectal Arabic (DA). CA was the language used in ancient days, and MSA is the primary written language used by the media and in education, these days. DA, however, is the everyday spoken language that exists in different varieties according to the country or Arab region the speaker is from. DA involves different dialects based on geographical locations in Arab countries (El-Beltagy and Ali, 2013). Therefore, DA varies geographically and socially and is not standardized (Zitouni, 2016). Dialects differ from MSA phonologically, morphologically and syntactically (Habash, 2010). Moreover, dialects do not have standard orthographies. This makes the task of building morphological analyzers and POS taggers for dialects a big challenge. Hence, these varieties of the Arabic language require advanced processing for Arabic text. Until recently, DA was mostly spoken and was never found in written form. The proliferation of social media has changed this trend, as Arab users tend to use DA in these new venues. Hence, DA is now also found in written form. This paper focuses on one of the Arabic dialects, namely Gulf (GLF). According to Habash (Habash, 2010), GLF Arabic includes the dialects of Kuwait, United Arab Emirates, Bahrain, Qatar, and Saudi Arabia.
In this paper, we aim to build POS taggers for tweets written in both MSA and the GLF dialect. We present two tagging models by using Conditional Random Fields (CRFs) and Bidirectional Long Short-Term Memory (Bi-LSTM). To train these models, we have constructed three datasets: ‘Mixed,’ ‘MSA,’ and ‘GLF’ with 3000, 1000, and 1000 Arabic tweets, respectively. A series of experiments were conducted to evaluate our trained models. We also investigated the effect of features derived from the morphological analyzer MADAMIRA (Pasha et al., 2014) on the tagging performance. To the best of our knowledge, the ‘Mixed’ dataset that has 3000 Arabic tweets is, till now, the largest annotated dataset extracted from Twitter for POS tagging.
The contributions of this paper are as follows:
- •
We present a POS tagger for Arabic tweets using a deep learning approach that achieves a state-of-the-art performance.
- •
POS taggers are developed for MSA and GLF variants of the Arabic language.
- •
The gold standard annotated datasets that have been constructed for POS tagging are made accessible to the research community.
- •
We present an exploratory analysis of the behavior of using hashtags in Arabic tweets, and this can be leveraged in future studies.
The paper is structured as follows. Section 2 presents related work. Section 3 introduces the dataset for POS tagging for Arabic tweets. Section 4 presents hashtag analysis. Section 5 demonstrates the adopted features. Section 6 presents the tagging methods. Section 7 describes experiments. Section 8 discusses the obtained results and presents error analysis. Finally, Section 9 concludes the paper and discusses future work.
Section snippets
Related work
POS tagging is a well-studied problem in NLP over the past decades. Several studies have been conducted to develop POS taggers that are tailored for social media text. Gimpel et al. (2011) presented one of the preliminary POS tagging methods for English tweets included in a web-based CMU Twitter NLP toolkit (ArkNLP). They developed a tagset of 25 tags and used it to annotate a corpus consisting of 1827 tweets (26,436 tokens). The corpus was divided into training/development/test sets of
Datasets
This section presents the datasets that have been constructed for the POS tagging of Arabic tweets. We first describe the annotation process, including how the data was collected and annotated, and present the adopted tagset in our POS tagging process. We then present statistical information on the datasets.
Hashtag analysis
Twitter users can use hashtags in different ways. In fact, the use of hashtags differs across cultures. For example, in English, underscores are not used, and hashtags with more than one word are written by capitalizing each word, such as #FirstSecond. In contrast, in Arabic, the underscores are inserted to delimit the parts of the word. It was observed that there were many hashtags in our dataset that are used as a part of the text in tweets and this affected their suitability to be tagged as
Features
Since, the MADARi (Obeid et al., 2018) annotation interface, which runs the morphological analyzer MADAMIRA (Pasha et al., 2014) in its core, was used; the form (POS.Features) was obtained using MADARi through automatic tagging. These features refer to specific morpho-syntactic aspects of the word. They represent aspect, person, gender, and number. Each feature has several values as follows:
Aspect: with the values Perfective (P), Imperfective (I) and Command (C).
Person: with the values 1st (1),
CRFs
Conditional Random Fields CRFs (Lafferty et al., 2001) have proven to achieve state-of-the-art performance in several sequence labeling tasks. CRFs estimate the probabilities of possible label sequences for a given observation sequence. A previous study has shown that the CRF-based tagger that was developed for English tweets achieved high accuracy (Gimpel et al., 2011). Hence, we used this sequence labeling method on our gold standard dataset which is much larger than the corpus of the
Experimental setting
We divided all three datasets, namely, ‘Mixed,’ ‘MSA,’ and ‘GLF’ into 80/10/10 train, development, and test sets, respectively. The test set consisted of tweets that were not used to set up the model, while the train and development sets were used to tune the classifier to the optimal value of the parameters that reported the best performance.
Our evaluation tested the efficiency of the proposed models for the task of POS tagging. We have experimentally evaluated the performance of the two
CRF model results
In these set of experiments, the effectiveness of the CRF approach was evaluated for POS tagging on the development sets of the three annotated datasets, namely ‘Mixed,’ ‘MSA,’ and ‘GLF’
Conclusion
We have introduced new datasets for POS tagging that are constructed from Arabic tweets. A supervised approach is used to train two different models, namely CRF and Bi-LSTM on such annotated datasets. It is shown that the proposed Bi-LSTM-based POS tagger achieves the state-of-the-art results over the existing dialect-specific models with 96.5% accuracy on a ‘Mixed’ dataset of 3000 tweets. However, the specific-dialect taggers for MSA and Gulf achieve an accuracy of 95.6% and 95%, respectively
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
We would like to thank Nizar Habash, Salam Khalifa, and Osama Obeid for providing us with access to MADARi for use in the annotation process. This research project was supported by a grant from the “Research Center of the Female Scientific and Medical Colleges,” Deanship of Scientific Research, King Saud University. The authors thank the Deanship of Scientific Research and RSSU at King Saud University for their technical support.
References (32)
- et al.
A Supervised POS Tagger for Written Arabic Social Networking Corpora
- et al.
Arabic Spam Detection in Twitter
- et al.
Developing a Competitive HMM Arabic POS Tagger Using Small Training Corpora
- et al.
Towards POS Tagging for Arabic Tweets
- et al.
Fast and Robust POS tagger for Arabic Tweets Using Agreement-based Bootstrapping
- et al.
POS Tagging for Arabic Tweets
- et al.
Part-of-Speech Tagging for Arabic Gulf Dialect Using Bi-LSTM
- CoNLL-U Format[WWW Document], n.d. URLhttps://universaldependencies.org/format.html(accessed...
- CRF++: Yet Another CRF toolkit[WWW Document], n.d. URLhttps://taku910.github.io/crfpp/(accessed...
- et al.
Multi-Dialect Arabic POS Tagging: a CRF Approach
Twitter Part-of-Speech Tagging for All: overcoming Sparse and Noisy Data
Second Generation AMIRA Tools for Arabic Processing: fast and Robust Tokenization, POS tagging, and Base Phrase Chunking
Open issues in the sentiment analysis of Arabic social media: a case study
Arabic Natural Language Processing: challenges and Solutions
Corpus annotation: Linguistic Information from Computer Text Corpora
Part-of-Speech Tagging for Twitter: annotation, Features, and Experiments
Cited by (41)
An effective approach for identifying keywords as high-quality filters to get emergency-implicated Twitter Spanish data
2024, Computer Speech and LanguageA novel method for signal labeling and precise location in a variable parameter milling process based on the stacked-BiLSTM-CRF and FLOSS
2023, Advanced Engineering InformaticsCitation Excerpt :To solve the sequence labeling problem, conventional statistical learning methods such as the hidden Markov model (HMM) [26–28] and conditional random field model (CRF) [29,30] are commonly used. In addition, deep learning methods such as the long short-term memory network (LSTM) [31,32] have also been widely applied to solve the sequence labeling problem. Due to the unique design of its gate structures, LSTM has a powerful extraction ability for time series features.
Identification of cyber harassment and intention of target users on social media platforms
2022, Engineering Applications of Artificial IntelligenceCitation Excerpt :According to the experiment’s findings, The CNN model and each feature worked together to get the optimum outcome and produced the best F1 score of 0.817 at p 0.01, which was much higher than that of any other model. AlKhwiter and Al-Twairesh (2021) presented two supervised POS taggers that were created using Bidirectional Long Short-Term Memory (Bi- LSTM) models with conditional random fields. The accuracy of the Bi-LSTM-based POS tagger is 96.5 per cent.
AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling
2022, Information Processing and ManagementCitation Excerpt :The results have shown that training a joint multi-dialectal POS tagging model outperforms its uni-dialectal counterpart. AlKhwiter and Al-Twairesh (2021) have investigated the POS tagging of Arabic tweets using both the CRF and bidirectional Long Short-Term Memory (LSTM) approaches. Their work has focused on MSA and the Gulf dialect.
Normalized Orthography for Tunisian Arabic
2024, arXivDeep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers
2024, ACM Transactions on Asian and Low-Resource Language Information Processing