当前位置:
X-MOL 学术
›
arXiv.cs.CL
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures
arXiv - CS - Computation and Language Pub Date : 2021-06-10 , DOI: arxiv-2106.05822 Ivan Chelombiev, Daniel Justus, Douglas Orr, Anastasia Dietrich, Frithjof Gressmann, Alexandros Koliousis, Carlo Luschi
arXiv - CS - Computation and Language Pub Date : 2021-06-10 , DOI: arxiv-2106.05822 Ivan Chelombiev, Daniel Justus, Douglas Orr, Anastasia Dietrich, Frithjof Gressmann, Alexandros Koliousis, Carlo Luschi
Attention based language models have become a critical component in
state-of-the-art natural language processing systems. However, these models
have significant computational requirements, due to long training times, dense
operations and large parameter count. In this work we demonstrate a set of
modifications to the structure of a Transformer layer, producing a more
efficient architecture. First, we add a convolutional module to complement the
self-attention module, decoupling the learning of local and global
interactions. Secondly, we rely on grouped transformations to reduce the
computational cost of dense feed-forward layers and convolutions, while
preserving the expressivity of the model. We apply the resulting architecture
to language representation learning and demonstrate its superior performance
compared to BERT models of different scales. We further highlight its improved
efficiency, both in terms of floating-point operations (FLOPs) and
time-to-train.
中文翻译:
GroupBERT:具有高效分组结构的增强型变压器架构
基于注意力的语言模型已成为最先进的自然语言处理系统中的关键组成部分。然而,由于训练时间长、操作密集和参数数量大,这些模型具有显着的计算要求。在这项工作中,我们展示了对 Transformer 层结构的一组修改,从而产生了更高效的架构。首先,我们添加了一个卷积模块来补充 self-attention 模块,将局部和全局交互的学习解耦。其次,我们依靠分组变换来降低密集前馈层和卷积的计算成本,同时保持模型的表达能力。我们将由此产生的架构应用于语言表示学习,并展示了与不同规模的 BERT 模型相比的优越性能。
更新日期:2021-06-11
中文翻译:
GroupBERT:具有高效分组结构的增强型变压器架构
基于注意力的语言模型已成为最先进的自然语言处理系统中的关键组成部分。然而,由于训练时间长、操作密集和参数数量大,这些模型具有显着的计算要求。在这项工作中,我们展示了对 Transformer 层结构的一组修改,从而产生了更高效的架构。首先,我们添加了一个卷积模块来补充 self-attention 模块,将局部和全局交互的学习解耦。其次,我们依靠分组变换来降低密集前馈层和卷积的计算成本,同时保持模型的表达能力。我们将由此产生的架构应用于语言表示学习,并展示了与不同规模的 BERT 模型相比的优越性能。