Multi-level embeddings for processing Arabic social media contents,Computer Speech & Language

当前位置： X-MOL 学术 › Comput. Speech Lang › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-level embeddings for processing Arabic social media contents
Computer Speech & Language ( IF 3.1 ) Pub Date : 2021-05-05 , DOI: 10.1016/j.csl.2021.101240
Leila Moudjari , Farah Benamara , Karima Akli-Astouati

Embeddings are very popular representations that allow computing semantic and syntactic similarities between linguistic units from text co-occurrence matrix. Units can vary from character n-grams to words, including more coarse-grained units such as sentences and documents. Recently, multi-level embeddings combining representations from different units have been proposed as an alternative to single-level embeddings to account for the internal structure of words (i.e., morphology) and help systems to generalise well over out of vocabulary words. These representations, either pre-trained or learned, have shown to be quite effective, outperforming word-level baselines in several NLP tasks such as machine translation, part of speech tagging and named entity recognition. Our aim here is to contribute to this line of research proposing for the first time in Arabic NLP an in-depth study of the impact of various subwords configurations ranging from character to character n-grams (including word) for social media text classification. We propose several neural architectures to learn character, subword and word embeddings, as well as a combination of these three levels, exploring different composition functions to obtain the final representation of a given text. To evaluate the effectiveness of these representations, we perform extrinsic evaluations on three text classification tasks (sentiment analysis, emotion detection and irony detection) while accounting for different Arabic varieties (Modern Standard Arabic, dialects (Levantine and Maghrebi)). For each task, we experiment with well-known dialect-agnostic and dialect-specific datasets, including those that have been recently used in shared tasks to better compare our results with those reported in previous studies on the same datasets. The results show that the multi-level embeddings we propose outperform current static and contextualised embeddings as well as best performing state of the art models in sentiment and emotion detection. In addition, we achieve competitive results in irony detection. Our models are also the most productive across dialects observing that different dialects require different composition configurations. We finally show that these performances tend to increase when coupling the multi-level representations with task-specific features.

中文翻译：

用于处理阿拉伯社交媒体内容的多层嵌入

嵌入是非常流行的表示形式，它允许从文本共现矩阵计算语言单元之间的语义和句法相似性。单位可以从字符n-gram到单词不等，包括更粗粒度的单位，例如句子和文档。近来，已经提出了结合来自不同单元的表示的多级嵌入作为单级嵌入的替代方案，以解决单词的内部结构（即形态）并帮助系统很好地概括词汇表单词。这些表示（无论是经过预训练还是学习的）都非常有效，在一些NLP任务（例如机器翻译，语音标记的一部分和命名实体识别）中胜过单词级别的基线。我们的目的是为这一研究领域做出贡献，这是阿拉伯语国家语言词典中首次提出对从字符到字符n-gram（包括单词）的各种子单词配置对社交媒体文本分类的影响进行的深入研究。我们提出了几种神经体系结构来学习字符，子词和词的嵌入，以及这三个层次的组合，探索不同的合成函数以获得给定文本的最终表示形式。为了评估这些表示的有效性，我们在考虑了不同的阿拉伯语变体（现代标准阿拉伯语，方言（Levantine和Maghrebi））的同时，对三种文本分类任务（情感分析，情感检测和反讽检测）进行了外部评估。对于每个任务，我们尝试使用不知名的方言和特定于方言的数据集，包括最近在共享任务中使用的那些数据集，以便更好地将我们的结果与以前在相同数据集上的研究报告的结果进行比较。结果表明，我们提出的多层嵌入优于当前的静态和上下文嵌入，以及在情感和情感检测方面表现最佳的最新模型。此外，我们在反讽检测方面取得了竞争性结果。我们的模型在各个方言中也是最高效的，因为不同的方言需要不同的构成配置。我们最终证明，将多级表示形式与特定于任务的功能结合在一起时，这些性能会有所提高。包括最近在共享任务中使用的那些数据，以更好地将我们的结果与以前的研究在相同数据集上报告的结果进行比较。结果表明，我们提出的多层嵌入优于当前的静态和上下文嵌入，以及在情感和情感检测方面表现最佳的最新模型。此外，我们在反讽检测方面取得了竞争性结果。我们的模型在各个方言中也是最高效的，因为不同的方言需要不同的构成配置。我们最终证明，将多级表示形式与特定于任务的功能结合在一起时，这些性能会有所提高。包括最近在共享任务中使用的那些数据，以更好地将我们的结果与以前的研究在相同数据集上报告的结果进行比较。结果表明，我们提出的多层嵌入优于当前的静态和上下文嵌入，以及在情感和情感检测方面表现最佳的最新模型。此外，我们在反讽检测方面取得了竞争性结果。我们的模型在各个方言中也是最高效的，因为不同的方言需要不同的构成配置。我们最终证明，将多级表示形式与特定于任务的功能结合在一起时，这些性能会有所提高。结果表明，我们提出的多层嵌入优于当前的静态和上下文嵌入，以及在情感和情感检测方面表现最佳的最新模型。此外，我们在反讽检测方面取得了竞争性结果。我们的模型在各个方言中也是最高效的，因为不同的方言需要不同的构成配置。我们最终证明，将多级表示形式与特定于任务的功能结合在一起时，这些性能会有所提高。结果表明，我们提出的多层嵌入优于当前的静态和上下文嵌入，以及在情感和情感检测方面表现最佳的最新模型。此外，我们在反讽检测方面取得了竞争性结果。我们的模型在各个方言中也是最高效的，因为不同的方言需要不同的构成配置。我们最终证明，将多级表示形式与特定于任务的功能结合在一起时，这些性能会有所提高。

更新日期：2021-05-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文