Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model
arXiv - CS - Computation and Language Pub Date : 2020-11-23 , DOI: arxiv-2011.11499
Juntao Li, Ruidan He, Hai Ye, Hwee Tou Ng, Lidong Bing, Rui Yan

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting. The source code of this paper is publicly available at https://github.com/lijuntaopku/UFD.

中文翻译：

预先训练的跨语言模型的无监督域自适应

最近的研究表明，在未标记的大规模文本上对跨语言模型进行预训练可以显着改善各种跨语言和资源匮乏的任务的性能。通过对一百种语言和数TB的文本进行培训，跨语言模型已被证明有效地利用了高资源语言来增强低资源语言处理并胜过单语言模型。在本文中，我们将进一步研究当预训练的跨语言语言模型需要适应新领域时的跨语言和跨域（CLCD）设置。具体来说，我们提出了一种新颖的无监督特征分解方法，该方法可以从纠缠的预训练跨语言表示中自动提取领域特定的特征和领域不变的特征，在源语言中给出未标记的原始文本。我们提出的模型利用互信息估计将由跨语言模型计算的表示分解为领域不变部分和领域特定部分。实验结果表明，与CLCD设置中最先进的预训练跨语言模型相比，我们提出的方法具有显着的性能提升。本文的源代码可从https://github.com/lijuntaopku/UFD公开获得。实验结果表明，与CLCD设置中最先进的预训练跨语言模型相比，我们提出的方法具有显着的性能提升。本文的源代码可从https://github.com/lijuntaopku/UFD公开获得。实验结果表明，与CLCD设置中最先进的预训练跨语言模型相比，我们提出的方法具有显着的性能提升。本文的源代码可从https://github.com/lijuntaopku/UFD公开获得。

更新日期：2020-11-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文