An empirical evaluation of text representation schemes to filter the social media stream,Journal of Experimental & Theoretical Artificial Intelligence

当前位置： X-MOL 学术 › J. Exp. Theor. Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An empirical evaluation of text representation schemes to filter the social media stream
Journal of Experimental & Theoretical Artificial Intelligence ( IF 1.7 ) Pub Date : 2021-04-24 , DOI: 10.1080/0952813x.2021.1907792
Sandip Modha _{1,

2} , Prasenjit Majumder ₁ , Thomas Mandl _{1,

3}

Affiliation

ABSTRACT

Modeling text in a numerical representation is a prime task for any Natural Language Processing downstream task such as text classification. This paper attempts to study the effectiveness of text representation schemes on the text classification task, such as aggressive text detection, a special case of Hate speech from social media. Aggression levels are categorized into three predefined classes, namely: ‘Non-aggressive’ (NAG), ‘Overtly Aggressive’ (OAG), and ‘Covertly Aggressive’ (CAG). Various text representation schemes based on BoW techniques, word embedding, contextual word embedding, sentence embedding on traditional classifiers, and deep neural models are compared on a text classification problem. The weighted $F_{1}$ score is used as a primary evaluation metric. The results show that text representation using Googles’ universal sentence encoder (USE) performs better than word embedding and BoW techniques on traditional classifiers, such as SVM, while pre-trained word embedding models perform better on classifiers based on the deep neural models on the English dataset. Recent pre-trained transfer learning models like Elmo, ULMFi, and BERT are fine-tuned for the aggression classification task. However, results are not at par with the pre-trained word embedding model. Overall, word embedding using pre-trained fastText vectors produces the best weighted $F_{1}$ -score than Word2Vec and Glove. On the Hindi dataset, BoW techniques perform better than word embeddings on traditional classifiers such as SVM. In contrast, pre-trained word embedding models perform better on classifiers based on the deep neural nets. Statistical significance tests are employed to ensure the significance of the classification results. Deep neural models are more robust against the bias induced by the training dataset. They perform substantially better than traditional classifiers, such as SVM, logistic regression, and Naive Bayes classifiers on the Twitter test dataset.

中文翻译：

过滤社交媒体流的文本表示方案的实证评估

摘要

以数字表示形式对文本进行建模是任何自然语言处理下游任务（例如文本分类）的主要任务。本文试图研究文本表示方案在文本分类任务上的有效性，例如侵略性文本检测，社交媒体仇恨言论的一个特例。攻击性级别分为三个预定义的类别，即：“非攻击性”（NAG）、“明显攻击性”（OAG）和“隐蔽攻击性”（CAG）。在文本分类问题上比较了基于 BoW 技术、词嵌入、上下文词嵌入、传统分类器上的句子嵌入和深度神经模型的各种文本表示方案。加权的 $F_{1}$ score 被用作主要的评估指标。结果表明，使用谷歌的通用句子编码器（USE）的文本表示在传统分类器（如 SVM）上的表现优于词嵌入和 BoW 技术，而预训练的词嵌入模型在基于深度神经模型的分类器上表现更好。英文数据集。最近的预训练迁移学习模型（如 Elmo、ULMFi 和 BERT）针对攻击分类任务进行了微调。但是，结果与预训练的词嵌入模型不相上下。总的来说，使用预训练的 fastText 向量的词嵌入产生了最好的加权 $F_{1}$ - 得分高于 Word2Vec 和 Glove。在印地语数据集上，BoW 技术在 SVM 等传统分类器上的表现优于词嵌入。相比之下，预训练的词嵌入模型在基于深度神经网络的分类器上表现更好。采用统计显着性检验来确保分类结果的显着性。深度神经模型对训练数据集引起的偏差更加稳健。它们在 Twitter 测试数据集上的性能明显优于传统分类器，例如 SVM、逻辑回归和朴素贝叶斯分类器。

更新日期：2021-04-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11