When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
arXiv - CS - Computation and Language Pub Date : 2021-02-24 , DOI: arxiv-2102.12459
Tao Lei

Large language models have become increasingly difficult to train because of the required computation time and cost. In this work, we present SRU++, a recurrent unit with optional built-in attention that exhibits state-of-the-art modeling capacity and training efficiency. On standard language modeling benchmarks such as enwik8 and Wiki-103 datasets, our model obtains better perplexity and bits-per-character (bpc) while using 2.5x-10x less training time and cost compared to top-performing Transformer models. Our results reaffirm that attention is not all we need and can be complementary to other sequential modeling modules. Moreover, fast recurrence with little attention can be a leading model architecture.

中文翻译：

当注意力遇到快速复发时：减少计算量的训练语言模型

由于所需的计算时间和成本，大型语言模型变得越来越难以训练。在这项工作中，我们介绍了SRU ++，这是一种具有可选内置注意功能的递归单元，具有最新的建模能力和培训效率。在标准语言建模基准（例如enwik8和Wiki-103数据集）上，我们的模型获得了更好的困惑度和每字符位数（bpc），而与性能最高的Transformer模型相比，其培训时间和成本却减少了2.5到10倍。我们的结果再次证明关注并不是我们所需要的，可以与其他顺序建模模块互补。而且，很少关注的快速复发可以成为领先的模型架构。

更新日期：2021-02-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文