Distilling Linguistic Context for Language Model Compression,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Distilling Linguistic Context for Language Model Compression
arXiv - CS - Computation and Language Pub Date : 2021-09-17 , DOI: arxiv-2109.08359
Geondo Park, Gyeongman Kim, Eunho Yang

A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation objective for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. Unlike other recent distillation techniques for the language models, our contextual distillation does not have any restrictions on architectural changes between teacher and student. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks, not only in architectures of various sizes, but also in combination with DynaBERT, the recently proposed adaptive size pruning method.

中文翻译：

提取语言模型压缩的语言上下文

最近语言表示学习取得成功的背后是一个计算成本高且内存密集的神经网络。知识蒸馏是在资源稀缺的环境中部署如此庞大的语言模型的一项主要技术，它可以不受限制地转移学习到的单个单词表示的知识。在本文中，受最近观察到的语言表示相对定位并且整体上具有更多语义知识的启发，我们提出了一种用于语言表示学习的新知识蒸馏目标，该目标通过两种类型的跨表示关系转移上下文知识：词关系和层转换关系。与其他最近的语言模型蒸馏技术不同，我们的上下文提炼对教师和学生之间的架构变化没有任何限制。我们验证了我们的方法在具有挑战性的语言理解任务基准测试中的有效性，不仅在各种规模的架构中，而且在与最近提出的自适应大小修剪方法 DynaBERT 结合使用时。

更新日期：2021-09-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>