当前位置: X-MOL 学术arXiv.cs.NE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling
arXiv - CS - Neural and Evolutionary Computing Pub Date : 2021-01-12 , DOI: arxiv-2101.04758
Muhammad Khalifa, Muhammad Abdul-Mageed, Khaled Shaalan

A sufficient amount of annotated data is required to fine-tune pre-trained language models for downstream tasks. Unfortunately, attaining labeled data can be costly, especially for multiple language varieties/dialects. We propose to self-train pre-trained language models in zero- and few-shot scenarios to improve the performance on data-scarce dialects using only resources from data-rich ones. We demonstrate the utility of our approach in the context of Arabic sequence labeling by using a language model fine-tuned on Modern Standard Arabic (MSA) only to predict named entities (NE) and part-of-speech (POS) tags on several dialectal Arabic (DA) varieties. We show that self-training is indeed powerful, improving zero-shot MSA-to-DA transfer by as large as \texttildelow 10\% F$_1$ (NER) and 2\% accuracy (POS tagging). We acquire even better performance in few-shot scenarios with limited labeled data. We conduct an ablation experiment and show that the performance boost observed directly results from the unlabeled DA examples for self-training and opens up opportunities for developing DA models exploiting only MSA resources. Our approach can also be extended to other languages and tasks.

中文翻译:

零和少量多方言阿拉伯语序列标签的自训练预训练语言模型

需要足够数量的注释数据来微调针对下游任务的预训练语言模型。不幸的是,获得带标签的数据可能会非常昂贵,尤其是对于多种语言/方言而言。我们建议在零镜头和很少镜头的情况下自训练预训练的语言模型,以仅使用来自数据丰富的方言中的资源来提高数据稀缺方言的性能。我们通过使用在现代标准阿拉伯语(MSA)上微调的语言模型仅预测几种方言上的命名实体(NE)和词性(POS)标签来证明我们的方法在阿拉伯语序列标签中的效用阿拉伯(DA)品种。我们证明了自我训练的确强大,可将零镜头MSA到DA的传输提高到\ texttildelow 10 \%F $ _1 $(NER)和2 \%精度(POS标记)。在有限的标记数据的情况下,我们可以在少数情况下获得更好的性能。我们进行了消融实验,结果表明,观察到的性能提升直接来自未标记的DA示例自我训练,并且为仅利用MSA资源开发DA模型提供了机会。我们的方法也可以扩展到其他语言和任务。
更新日期:2021-01-14
down
wechat
bug