Primer: Searching for Efficient Transformers for Language Modeling,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Primer: Searching for Efficient Transformers for Language Modeling
arXiv - CS - Computation and Language Pub Date : 2021-09-17 , DOI: arxiv-2109.08668
David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le

Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility.

中文翻译：

入门：为语言建模寻找高效的 Transformer

大型 Transformer 模型一直是自然语言处理最新进展的核心。然而，这些模型的训练和推理成本增长迅速，成本高得令人望而却步。在这里，我们的目标是通过寻找更高效的变体来降低 Transformer 的成本。与之前的方法相比，我们的搜索是在较低级别上执行的，在定义 Transformer TensorFlow 程序的原语上进行。我们确定了一种名为 Primer 的架构，它的训练成本比原始 Transformer 和用于自回归语言建模的其他变体要小。Primer 的改进主要归功于两个简单的修改：平方 ReLU 激活并在自注意力中的每个 Q、K 和 V 投影后添加一个深度卷积层。实验表明 Primer' 随着计算规模的增长并遵循关于最佳模型大小的质量的幂律，s 超过 Transformer 的收益会增加。我们还凭经验验证了 Primer 可以放入不同的代码库，以显着加快训练速度，而无需额外调整。例如，在 500M 的参数大小下，Primer 在 C4 自回归语言建模上改进了原始 T5 架构，将训练成本降低了 4 倍。此外，降低的训练成本意味着 Primer 需要更少的计算来达到目标一次性性能。例如，在类似于 GPT-3 XL 的 1.9B 参数配置中，Primer 使用 1/3 的训练计算来实现与 Transformer 相同的一次性性能。我们开源了我们的模型和 T5 中的几个比较，以帮助提高可重复性。

更新日期：2021-09-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文