Knowledge Distillation from Internal Representations,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Knowledge Distillation from Internal Representations
arXiv - CS - Computation and Language Pub Date : 2019-10-08 , DOI: arxiv-1910.03723
Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, Chenlei Guo

Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.

中文翻译：

从内部表征中提取知识

知识蒸馏通常是通过训练一个小模型（学生）来模仿一个大而笨重的模型（教师）来进行的。这个想法是通过使用教师的输出概率作为软标签来优化学生的知识，从而压缩来自教师的知识。但是，当老师很大时，不能保证老师的内在知识会转移到学生身上；即使学生与软标签紧密匹配，其内部表示也可能有很大不同。这种内部不匹配会破坏原本打算从教师转移到学生的泛化能力。在本文中，我们建议将大型模型（如 BERT）的内部表示提炼为它的简化版本。我们制定了两种提炼这种表示的方法和各种算法来进行提炼。我们对来自 GLUE 基准测试的数据集进行了实验，并始终如一地表明，从内部表示中添加知识蒸馏是一种比仅使用软标签蒸馏更强大的方法。

更新日期：2020-01-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>