当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards Making the Most of BERT in Neural Machine Translation
arXiv - CS - Computation and Language Pub Date : 2019-08-15 , DOI: arxiv-1908.05672
Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Yong Yu, Weinan Zhang, Lei Li

GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various natural language processing tasks. However, LM fine-tuning often suffers from catastrophic forgetting when applied to resource-rich tasks. In this work, we introduce a concerted training framework (\method) that is the key to integrate the pre-trained LMs to neural machine translation (NMT). Our proposed Cnmt consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained knowledge; b) a dynamic switching gate to avoid catastrophic forgetting of pre-trained knowledge; and c) a strategy to adjust the learning paces according to a scheduled policy. Our experiments in machine translation show \method gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pre-training aided NMT by 1.4 BLEU score. While for the large WMT14 English-French task with 40 millions of sentence-pairs, our base model still significantly improves upon the state-of-the-art Transformer big model by more than 1 BLEU score.

中文翻译:

在神经机器翻译中充分利用 BERT

GPT-2 和 BERT 证明了在各种自然语言处理任务上使用预训练语言模型 (LM) 的有效性。然而,当应用于资源丰富的任务时,LM 微调经常遭受灾难性的遗忘。在这项工作中,我们引入了一个协调的训练框架 (\method),这是将预训练的 LM 集成到神经机器翻译 (NMT) 的关键。我们提出的 Cnmt 包括三种技术:a)渐近蒸馏,以确保 NMT 模型可以保留先前的预训练知识;b) 动态切换门,以避免灾难性地忘记预先训练的知识;c) 根据预定策略调整学习速度的策略。我们的机器翻译实验表明,\method 在 WMT14 英德语言对上获得了高达 3 BLEU 的分数,甚至超过了之前最先进的预训练辅助 NMT 1.4 BLEU 分数。虽然对于具有 4000 万句对的大型 WMT14 英法任务,我们的基本模型仍然比最先进的 Transformer 大模型显着提高了 1 个以上的 BLEU 分数。
更新日期:2020-03-27
down
wechat
bug