Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation
arXiv - CS - Computation and Language Pub Date : 2021-06-10 , DOI: arxiv-2106.05691
Yuanxin Liu, Fandong Meng, Zheng Lin, Weiping Wang, Jie Zhou

Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher's soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student's performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher's hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher's hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single dimension and then jointly compress the three dimensions. In this way, we show that 1) the student's performance can be improved by extracting and distilling the crucial HSK, and 2) using a tiny fraction of HSK can achieve the same performance as extensive HSK distillation. Based on the second finding, we further propose an efficient KD paradigm to compress BERT, which does not require loading the teacher during the training of student. For two kinds of student models and computing devices, the proposed KD paradigm gives rise to training speedup of 2.7x ~ 3.4x.

中文翻译：

边际效用递减：探索BERT知识蒸馏的最小知识

最近，知识蒸馏（KD）在 BERT 压缩方面取得了巨大成功。研究人员发现 BERT 隐藏层中包含的丰富信息有利于学生的表现，而不是像传统 KD 那样只从教师的软标签中学习。为了更好地利用隐藏知识，一种常见的做法是强迫学生以分层的方式深度模仿老师所有令牌的隐藏状态。然而，在本文中，我们观察到虽然提炼教师的隐藏状态知识 (HSK) 是有帮助的，但随着提炼出更多的 HSK，性能增益（边际效用）会迅速减少。为了理解这种影响，我们进行了一系列分析。具体来说，我们将BERT的HSK分为深度、长度和宽度三个维度。我们首先研究各种策略来为每个单一维度提取关键知识，然后联合压缩三个维度。通过这种方式，我们表明 1) 学生的成绩可以通过提取和蒸馏关键的 HSK 来提高，2) 使用一小部分 HSK 可以达到与广泛的 HSK 蒸馏相同的性能。基于第二个发现，我们进一步提出了一种有效的 KD 范式来压缩 BERT，它不需要在学生训练期间加载教师。对于两种学生模型和计算设备，所提出的 KD 范式使训练速度提高了 2.7 倍~3.4 倍。可以通过提取和蒸馏关键的 HSK 来提高性能，并且 2) 使用一小部分 HSK 可以获得与广泛的 HSK 蒸馏相同的性能。基于第二个发现，我们进一步提出了一种有效的 KD 范式来压缩 BERT，它不需要在学生训练期间加载教师。对于两种学生模型和计算设备，所提出的 KD 范式使训练速度提高了 2.7 倍~3.4 倍。可以通过提取和蒸馏关键的 HSK 来提高性能，并且 2) 使用一小部分 HSK 可以获得与广泛的 HSK 蒸馏相同的性能。基于第二个发现，我们进一步提出了一种有效的 KD 范式来压缩 BERT，它不需要在学生训练期间加载教师。对于两种学生模型和计算设备，所提出的 KD 范式使训练速度提高了 2.7 倍~3.4 倍。

更新日期：2021-06-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>