当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
arXiv - CS - Computation and Language Pub Date : 2021-02-24 , DOI: arxiv-2102.12459
Tao Lei

Large language models have become increasingly difficult to train because of the required computation time and cost. In this work, we present SRU++, a recurrent unit with optional built-in attention that exhibits state-of-the-art modeling capacity and training efficiency. On standard language modeling benchmarks such as enwik8 and Wiki-103 datasets, our model obtains better perplexity and bits-per-character (bpc) while using 2.5x-10x less training time and cost compared to top-performing Transformer models. Our results reaffirm that attention is not all we need and can be complementary to other sequential modeling modules. Moreover, fast recurrence with little attention can be a leading model architecture.

中文翻译:

当注意力遇到快速复发时:减少计算量的训练语言模型

由于所需的计算时间和成本,大型语言模型变得越来越难以训练。在这项工作中,我们介绍了SRU ++,这是一种具有可选内置注意功能的递归单元,具有最新的建模能力和培训效率。在标准语言建模基准(例如enwik8和Wiki-103数据集)上,我们的模型获得了更好的困惑度和每字符位数(bpc),而与性能最高的Transformer模型相比,其培训时间和成本却减少了2.5到10倍。我们的结果再次证明关注并不是我们所需要的,可以与其他顺序建模模块互补。而且,很少关注的快速复发可以成为领先的模型架构。
更新日期:2021-02-25
down
wechat
bug