Uncertainty-aware Self-training for Text Classification with Few Labels,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Uncertainty-aware Self-training for Text Classification with Few Labels
arXiv - CS - Computation and Language Pub Date : 2020-06-27 , DOI: arxiv-2006.15315
Subhabrata Mukherjee, Ahmed Hassan Awadallah

Recent success of large-scale pre-trained language models crucially hinge on fine-tuning them on large amounts of labeled data for the downstream task, that are typically expensive to acquire. In this work, we study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck by making use of large-scale unlabeled data for the target task. Standard self-training mechanism randomly samples instances from the unlabeled pool to pseudo-label and augment labeled data. In this work, we propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network leveraging recent advances in Bayesian deep learning. Specifically, we propose (i) acquisition functions to select instances from the unlabeled pool leveraging Monte Carlo (MC) Dropout, and (ii) learning mechanism leveraging model confidence for self-training. As an application, we focus on text classification on five benchmark datasets. We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models fine-tuned on thousands of labeled instances with an aggregate accuracy of 91% and improving by upto 12% over baselines.

中文翻译：

用于少标签文本分类的不确定性自训练

大规模预训练语言模型的近期成功关键取决于对下游任务的大量标记数据进行微调，而这些数据的获取通常很昂贵。在这项工作中，我们研究了自我训练作为最早的半监督学习方法之一，通过将大规模未标记数据用于目标任务来减少注释瓶颈。标准的自训练机制从未标记的池中随机抽取实例，以伪标记和扩充标记数据。在这项工作中，我们提出了一种改进自我训练的方法，利用贝叶斯深度学习的最新进展，结合基础神经网络的不确定性估计。具体来说，我们提出 (i) 获取函数以利用 Monte Carlo (MC) Dropout 从未标记的池中选择实例，(ii) 利用模型信心进行自我训练的学习机制。作为一个应用程序，我们专注于五个基准数据集上的文本分类。我们展示了我们的方法对每个任务只利用每类 20-30 个标记样本进行训练和验证可以在 3% 内执行完全监督的预训练语言模型，在数千个标记实例上进行微调，总准确率为 91% 和比基线提高了 12%。

更新日期：2020-06-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>