A novel domain and event adaptive tweet augmentation approach for enhancing the classification of crisis related tweets

https://doi.org/10.1016/j.datak.2021.101913Get rights and content

Abstract

One of the purposes of detecting the crisis related tweets is the ability to single out the tweets that provide information about the helps needed and offered. Classification of such tweets is difficult because of the unavailability of sufficient annotated tweets in those categories. To facilitate such classifications, a domain and event adaptive augmentation approach is proposed. The main objective of the research is to enhance the classification of crisis related tweets that have less training samples. The proposed algorithms are designed to integrate the innate domain and event information during the selection of words for augmentation. Components such as CrisisLex lexicon, Word2Vec embeddings and WordNet are utilized for the proposed augmentation. Experimentation is carried out to substantiate the benefits of augmentation. Results indicate increased performance of the classifier when provided with the expanded dataset including the augmented and original tweets. To combat the problem of overfitting and class imbalance arising due to the lesser training samples, a novel tweets augmentation algorithm can be utilized. The advantage in the proposed algorithms is the ability to retain the structure and inherent nature of the tweets during the augmentation.

Introduction

Data augmentation in texts is marginal unlike the considerable implementation performed in image analytics [1]. The challenge here is to identify a transformation of the original text, that holds the same meaning and theme. The difficulties in acquiring the relevant words, escalate with the informal nature of the Twitter data. To retain the structure of the tweet in terms of the syntax and retain the semantics of the tweet in terms of the theme of tweet, is very challenging. The works in literature propose a few approaches for the augmentation but there is a deficit in approaches incorporating the domain and event knowledge in the augmentation.

The use case of tweets related to requesting/offering help is utilized for demonstrating the augmentation. The proposed algorithms for domain and event based augmentation are employed and experimentation is carried out for comprehending the advantages of tweet augmentation. The experiments on classification of tweets without augmentation and with augmentation are performed. The results indicate enhanced performance with augmented tweets with the help of utilizing the ProMinent Words feature.

The purpose of identifying the disaster related tweets is to resort to the needs of the victims and provide the required help accordingly. The victims stranded in the crisis sites have to communicate the required needs to the outside world.

In situations, when a victim does not know whom to contact specifically, the availability of a means to communicate the needs, as a broadcast to everyone or render the requirements to a higher authority or lay out the information out in the open using a related hashtag for the concerned rescue team to notice, is valuable to the victims. The need for food, emergency medication needs, shelters, drinking water, mode of transportation are some examples of the large set of help/aid required by people [2]. Another kind of situation is where volunteers offer donations in terms of food, clothing, shelter etc., for people in need. The information has to be communicated to the response teams so that these needs and donations can be matched. Also if it reaches the victims directly, they may contact the volunteer precisely and obtain the donations. The challenge in identifying the needs and offers from the tweets is that these crucial informations are hidden behind the mixture of chaotic conversations expressing sentiments, official announcements, periodical updates, duplicates and rumors. For the concerned authorities and volunteers to serve immediately, a system for better classification of these tweets is essential [3], [4]. Amidst the noise of unimportant conversational information, extracting the important need and availability tweets is a challenge [4]. The data sparsity in terms of the less number of annotated data combined with the noisy conversational nature of tweets makes it much difficult to learn from the data [5].

The classification model analyzing the tweets should be able to single out these tweets and mark them as important to be sent to the concerned rescue team. For the model to learn this kind of tweets, there has to be sufficient training samples in the category of call for help/rescue and offering donations. With the Twitter corpora available in the current scenario, the number of annotated tweets for this category are very less. This data sparsity has to be handled efficiently to produce effective results. Data augmentation provides a suitable solution for this problem and helps in enhancing the classification of such categories.

Data Augmentation is the process of creating new instances of text data derived from a set of original text data. Variations are instilled on the original data and are produced as augmented data. The data augmentation is prevalent in image processing to generate many variations of one picture to obtain augmented new pictures [6]. These pictures are fed as inputs to the learning model to help it learn from the variations and get a good idea of the image and the variations. Data augmentation for texts is minimal and developments are being proposed in recent years [7]. One of the most common approaches of text augmentation is to obtain synonyms of a word from the original text and use it in the derived text [8], [9].

Section snippets

Related work

Data augmentation for text data is limited owing to the difficulty in obtaining suitable augmentation words that can increase the training samples but hold the semantics as such, to provide a new text in the same label. Data augmentation is helpful in increasing the size of the dataset and also helps in enhancing the robustness of the model [1]. Some of the approaches are reviewed in this section to gain insight towards tweets augmentation that help in determining the necessary advancements.

The

Contribution

Data augmentation is employed to handle the classification problems with less training samples. This can also be looked as a class imbalance issue which can be solved by the increased samples in a particular category. The lesser training data when fed to the learning models may result in overfitting and eventually lesser performance with the validation data or with the real time [5].

A domain and event based augmentation approach is proposed to generate three new versions of an original tweet.

Methodology

Preserving the nature of the original tweet in the generated augmented tweets is crucial for attaining good effective performance. This is to make sure that the augmented tweets imply the same meaning as the original tweet and helps the model learn variations in the data of the same category. Having augmented data too different from the original data may misguide the model into learning the category with outliers. It is pertinent to ensure the inclusion of domain and event related information

Experimental details

The aim of the tweet augmentation is to enhance the classification of tweets that have fewer training samples. The domain and event based tweet augmentation algorithms introduced in the previous section provide new tweets that can be utilized for expanding the training dataset. Experimentation is carried out to demonstrate the usefulness of the augmentation approach with the improvement in the classification result. The comparison of classifier performance with the original and augmented tweets

Metrics.

A total of 430 tweets from EMTerms and 2238 tweets from NARMADA are acquired from the above mentioned categories occupying a space of 142.9kB. Each tweet from this original tweets dataset is augmented into three versions resulting in the generation of a total of 7836 augmented tweets. Along with the original tweets a total of 10448 tweets with a size of 6.5MB, are gathered as training samples for the disaster category for the classification experiment.

The aim of the experiment is to assess and

A tweet augmentation engine.

The contribution presented in the research work is an independent augmentation engine that provides three versions of the input tweet. The proposed algorithms are crafted to furnish three different yet domain adaptive variations of the tweet. In situations where very less number of tweets are available, the augmentation process can be looped. In that way, the algorithms can render many generated tweets with a very minimal number of original tweets. This can be very helpful in training the

Significance

Huge information regarding the crises situations are shared in the social media with the invent of digitization. From the clutter of such noisy data, it is highly difficult to extract and annotate the tweets with crucial information. According to a study conducted by the authors in [27], the posts on Twitter during emergency events contribute to around 15%, containing crucial information requiring emergency response, specifically around 1% of tweets requiring imminent response. With little

Conclusion

Augmentation is performed when there is a scarcity in the number of training samples. The algorithms proposed in this thesis are efficient in augmenting three versions of a tweet which aligns with the domain and event of the original tweet, so that the essence of the original tweet is sustained. An experiment is carried out to demonstrate a call for help and rescue category of tweets in the crisis domain being augmented and the classification results highlights improvement in the metrics. The

CRediT authorship contribution statement

Dharini Ramachandran: Conceptualization, Methodology, Software, Data curation, Writing – original draft, Visualization, Writing – reviewing and editing. Parvathi R.: Investigation, Supervision, Validation, Writing - reviewing and editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Dharini R. is a research scholar pursuing Ph.D. in Vellore Institute of Technology, Chennai, India. She obtained her B.Tech and M.E from Anna University, Chennai, India. Her research interest includes Social Media Text Analytics, Natural Language Processing, Deep Learning, Artificial Intelligence.

References (29)

  • ZhangX. et al.

    Character-level convolutional networks for text classification

  • W.Y. Wang, D. Yang, That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to...
  • J. Risch, R. Krestel, Aggression identification using deep learning and data augmentation, in: Proceedings of the First...
  • S. Sharifirad, B. Jafarpour, S. Matwin, Boosting text classification performance on sexist tweets by text augmentation...
  • Cited by (0)

    Dharini R. is a research scholar pursuing Ph.D. in Vellore Institute of Technology, Chennai, India. She obtained her B.Tech and M.E from Anna University, Chennai, India. Her research interest includes Social Media Text Analytics, Natural Language Processing, Deep Learning, Artificial Intelligence.

    Parvathi R. is a Professor in Vellore Institute of Technology, Chennai, India. She obtained her Ph.D. from Anna University in the field of spatial data mining. She has research experience in the field of Data mining and Big Data Analytics.

    View full text