当前位置: X-MOL 学术IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-Teacher Distillation With Single Model for Neural Machine Translation
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 4.1 ) Pub Date : 2022-02-28 , DOI: 10.1109/taslp.2022.3153264
Xiaobo Liang 1 , Lijun Wu 2 , Juntao Li 1 , Tao Qin 2 , Min Zhang 1 , Tie-Yan Liu 2
Affiliation  

Knowledge distillation (KD) is an effective strategy for neural machine translation (NMT) to improve the performance of a student model. Usually, the teacher can guide the student to be better by distilling the soft label or data knowledge from the teacher itself. However, the data diversity and teacher knowledge are limited with only one teacher model. Though a natural solution is to adopt multiple randomized teacher models, one big shortcoming is that the model parameters and training costs are largely increased with the number of teacher models. In this work, we explore to mimic multiple teacher distillation from the sub-network space and permuted variants of one single teacher model. Specifically, we train a teacher by multiple sub-network extraction paradigms: sub-layer reordering, layer-drop, and dropout variants. In doing so, one teacher model can provide multiple outputs variants and causes neither additional parameters nor much extra training cost. Experiments on 8 IWSLT datasets: IWSLT14 En ↔\leftrightarrow De, En ↔\leftrightarrow Es and IWSLT17 En ↔\leftrightarrow Fr, En ↔\leftrightarrow Zh and the large WMT14 EN →\to DE translation tasks show that our method even achieves nearly comparable performance with multiple teacher models with different randomized parameters, both word-level and sequence-level knowledge distillation. Our code is available online at https://github.com/dropreg/RLD.

中文翻译:


用于神经机器翻译的单一模型的多教师蒸馏



知识蒸馏(KD)是神经机器翻译(NMT)提高学生模型性能的有效策略。通常,老师可以通过提炼老师本身的软标签或数据知识来引导学生变得更好。然而,只有一种教师模型,数据多样性和教师知识受到限制。虽然自然的解决方案是采用多个随机教师模型,但一个很大的缺点是模型参数和训练成本随着教师模型数量的增加而大幅增加。在这项工作中,我们探索从子网络空间模拟多个教师蒸馏以及单个教师模型的排列变体。具体来说,我们通过多个子网络提取范例来训练教师:子层重新排序、层丢弃和丢弃变体。这样做时,一个教师模型可以提供多个输出变体,并且既不会产生额外的参数,也不会产生太多额外的训练成本。在 8 个 IWSLT 数据集上的实验:IWSLT14 En ↔\leftrightarrow De、En ↔\leftrightarrow Es 和 IWSLT17 En ↔\leftrightarrow Fr、En ↔\leftrightarrow Zh 和大型 WMT14 EN →\to DE 翻译任务表明,我们的方法甚至实现了几乎可比的效果具有不同随机参数的多个教师模型的性能,包括单词级和序列级知识蒸馏。我们的代码可在线获取:https://github.com/dropreg/RLD。
更新日期:2022-02-28
down
wechat
bug