GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures
arXiv - CS - Computation and Language Pub Date : 2021-06-10 , DOI: arxiv-2106.05822
Ivan Chelombiev, Daniel Justus, Douglas Orr, Anastasia Dietrich, Frithjof Gressmann, Alexandros Koliousis, Carlo Luschi

Attention based language models have become a critical component in state-of-the-art natural language processing systems. However, these models have significant computational requirements, due to long training times, dense operations and large parameter count. In this work we demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. First, we add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. Secondly, we rely on grouped transformations to reduce the computational cost of dense feed-forward layers and convolutions, while preserving the expressivity of the model. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales. We further highlight its improved efficiency, both in terms of floating-point operations (FLOPs) and time-to-train.

中文翻译：

GroupBERT：具有高效分组结构的增强型变压器架构

基于注意力的语言模型已成为最先进的自然语言处理系统中的关键组成部分。然而，由于训练时间长、操作密集和参数数量大，这些模型具有显着的计算要求。在这项工作中，我们展示了对 Transformer 层结构的一组修改，从而产生了更高效的架构。首先，我们添加了一个卷积模块来补充 self-attention 模块，将局部和全局交互的学习解耦。其次，我们依靠分组变换来降低密集前馈层和卷积的计算成本，同时保持模型的表达能力。我们将由此产生的架构应用于语言表示学习，并展示了与不同规模的 BERT 模型相比的优越性能。

更新日期：2021-06-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文