当前位置: X-MOL 学术Complex Intell. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Zero-shot domain paraphrase with unaligned pre-trained language models
Complex & Intelligent Systems ( IF 5.0 ) Pub Date : 2022-08-01 , DOI: 10.1007/s40747-022-00820-8
Zheng Chen , Hu Yuan , Jiankun Ren

Automatic paraphrase generation is an essential task of natural language processing. However, due to the scarcity of paraphrase corpus in many languages, Chinese, for example, generating high-quality paraphrases in these languages is still challenging. Especially in domain paraphrasing, it is even more difficult to obtain in-domain paraphrase sentence pairs. In this paper, we propose a novel approach for domain-specific paraphrase generation in a zero-shot fashion. Our approach is based on a sequence-to-sequence architecture. The encoder uses a pre-trained multilingual autoencoder model, and the decoder uses a pre-trained monolingual autoregressive model. Because these two models are pre-trained separately, they have different representations for the same token. Thus, we call them unaligned pre-trained language models. We train the sequence-to-sequence model with an English-to-Chinese machine translation corpus. Then, by inputting a Chinese sentence into this model, it could surprisingly generate fluent and diverse Chinese paraphrases. Since the unaligned pre-trained language models have inconsistent understandings of the Chinese language, we believe that the Chinese paraphrasing is actually performed in a Chinese-to-Chinese translation manner. In addition, we collect a small-scale English-to-Chinese machine translation corpus in the domain of computer science. By fine-tuning with this domain-specific corpus, our model shows an excellent capability of domain-paraphrasing. Experiment results show that our approach significantly outperforms previous baselines regarding Relevance, Fluency, and Diversity.



中文翻译:

使用未对齐的预训练语言模型进行零样本域释义

自动释义生成是自然语言处理的一项基本任务。然而,由于许多语言中释义语料库的稀缺性,例如中文,在这些语言中生成高质量的释义仍然具有挑战性。尤其是在领域释义中,获得领域内释义句对更加困难。在本文中,我们提出了一种以零样本方式生成特定领域释义的新方法。我们的方法基于序列到序列的架构。编码器使用预训练的多语言自编码模型,解码器使用预训练的单语自回归模型。因为这两个模型是分别预训练的,所以它们对同一个 token 有不同的表示。因此,我们称它们为未对齐的预训练语言模型。我们使用英汉机器翻译语料库训练序列到序列模型。然后,通过将中文句子输入到这个模型中,它可以惊人地生成流畅多样的中文释义。由于未对齐的预训练语言模型对中文的理解不一致,我们认为中文释义实际上是以汉译中的方式进行的。此外,我们收集了一个计算机科学领域的小型英汉机器翻译语料库。通过使用这个特定领域的语料库进行微调,我们的模型显示出出色的领域释义能力。实验结果表明,我们的方法在相关性、流畅性和多样性方面明显优于以前的基线。

更新日期:2022-08-02
down
wechat
bug