Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data
arXiv - CS - Computation and Language Pub Date : 2020-03-16 , DOI: arxiv-2003.11563
Harish Tayyar Madabushi, Elena Kochkina, Michael Castelle

The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news documents and other forms of decontextualized social communication (e.g. sentiment analysis), inherently deals with data whose categories are simultaneously imbalanced and dissimilar. We show that BERT, while capable of handling imbalanced classes with no additional data augmentation, does not generalise well when the training and test data are sufficiently dissimilar (as is often the case with news sources, whose topics evolve over time). We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT when the training and test sets are dissimilar. We test these methods on the Propaganda Techniques Corpus (PTC) and achieve the second-highest score on sentence-level propaganda classification.

中文翻译：

用于具有不平衡数据的可概括句子分类的成本敏感型 BERT

近年来，由于新闻产生和消费方式的技术和社会变化，宣传的自动识别变得越来越重要。可以使用 BERT 有效地解决这项任务，这是一种强大的新架构，可以针对文本分类任务进行微调，这并不奇怪。然而，宣传检测，与处理新闻文件和其他形式的去上下文化社会交流（例如情感分析）的其他任务一样，本质上处理的数据类别同时不平衡和不同。我们展示了 BERT 虽然能够在没有额外数据增强的情况下处理不平衡的类，但当训练和测试数据足够不同时（就像新闻来源经常发生的情况一样，其主题随着时间的推移而变化）不能很好地泛化。我们展示了如何通过提供数据集之间相似性的统计度量以及在训练和测试集不同时将成本加权纳入 BERT 的方法来解决这个问题。我们在宣传技术语料库 (PTC) 上测试了这些方法，并在句子级宣传分类上获得了第二高的分数。

更新日期：2020-03-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文