Lexical data augmentation for sentiment analysis,Journal of the Association for Information Science and Technology

当前位置： X-MOL 学术 › J. Assoc. Inf. Sci. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Lexical data augmentation for sentiment analysis
Journal of the Association for Information Science and Technology ( IF 2.8 ) Pub Date : 2021-06-17 , DOI: 10.1002/asi.24493
Rong Xiang ₁ , Emmanuele Chersoni ₁ , Qin Lu ₁ , Chu‐Ren Huang ₁ , Wenjie Li ₁ , Yunfei Long ₂

Affiliation

Machine learning methods, especially deep learning models, have achieved impressive performance in various natural language processing tasks including sentiment analysis. However, deep learning models are more demanding for training data. Data augmentation techniques are widely used to generate new instances based on modifications to existing data or relying on external knowledge bases to address annotated data scarcity, which hinders the full potential of machine learning techniques. This paper presents our work using part-of-speech (POS) focused lexical substitution for data augmentation (PLSDA) to enhance the performance of machine learning algorithms in sentiment analysis. We exploit POS information to identify words to be replaced and investigate different augmentation strategies to find semantically related substitutions when generating new instances. The choice of POS tags as well as a variety of strategies such as semantic-based substitution methods and sampling methods are discussed in detail. Performance evaluation focuses on the comparison between PLSDA and two previous lexical substitution-based data augmentation methods, one of which is thesaurus-based, and the other is lexicon manipulation based. Our approach is tested on five English sentiment analysis benchmarks: SST-2, MR, IMDB, Twitter, and AirRecord. Hyperparameters such as the candidate similarity threshold and number of newly generated instances are optimized. Results show that six classifiers (SVM, LSTM, BiLSTM-AT, bidirectional encoder representations from transformers [BERT], XLNet, and RoBERTa) trained with PLSDA achieve accuracy improvement of more than 0.6% comparing to two previous lexical substitution methods averaged on five benchmarks. Introducing POS constraint and well-designed augmentation strategies can improve the reliability of lexical data augmentation methods. Consequently, PLSDA significantly improves the performance of sentiment analysis algorithms.

中文翻译：

用于情感分析的词法数据增强

机器学习方法，尤其是深度学习模型，在包括情感分析在内的各种自然语言处理任务中取得了令人瞩目的表现。然而，深度学习模型对训练数据的要求更高。数据增强技术被广泛用于基于对现有数据的修改或依赖外部知识库来生成新实例来解决带注释的数据稀缺问题，这阻碍了机器学习技术的全部潜力。本文介绍了我们使用词性 (POS) 聚焦词法替代数据增强 (PLSDA) 来提高机器学习算法在情感分析中的性能的工作。我们利用 POS 信息来识别要替换的单词，并研究不同的增强策略以在生成新实例时找到语义相关的替换。POS 标签的选择以及基于语义的替换方法和采样方法等各种策略进行了详细讨论。性能评估侧重于PLSDA与之前两种基于词法替换的数据增强方法的比较，一种是基于词库的，另一种是基于词库操作的。我们的方法在五个英语情感分析基准上进行了测试：SST-2、MR、IMDB、Twitter 和 AirRecord。对候选相似度阈值和新生成实例的数量等超参数进行了优化。结果表明，六个分类器（SVM、LSTM、BiLSTM-AT、来自使用 PLSDA 训练的转换器 [BERT]、XLNet 和 RoBERTa）的双向编码器表示，与在五个基准上平均的先前两种词汇替换方法相比，精度提高了 0.6% 以上。引入 POS 约束和精心设计的扩充策略可以提高词法数据扩充方法的可靠性。因此，PLSDA 显着提高了情感分析算法的性能。

更新日期：2021-06-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11