A Practical Survey on Faster and Lighter Transformers,ACM Computing Surveys

当前位置： X-MOL 学术 › ACM Comput. Surv. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Practical Survey on Faster and Lighter Transformers
ACM Computing Surveys ( IF 23.8 ) Pub Date : 2023-07-17 , DOI: 10.1145/3586074
Quentin Fournier ₁ , Gaétan Marceau Caron ₂ , Daniel Aloise ₁

Affiliation

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.

中文翻译：

关于更快、更轻变压器的实用调查

循环神经网络是处理序列的有效模型。然而，由于其固有的顺序性质，它们无法学习长期依赖关系。作为解决方案，Vaswani 等人。引入了 Transformer，这是一个完全基于注意力机制的模型，能够将输入序列的任意两个位置关联起来，从而对任意长依赖关系进行建模。Transformer 提高了众多序列建模任务的最先进水平。然而，它的有效性是以序列长度方面的二次计算和内存复杂性为代价的，阻碍了它的采用。幸运的是，深度学习社区一直对提高模型效率感兴趣，从而产生了大量的解决方案，例如参数共享、剪枝、混合精度和知识蒸馏。最近，研究人员通过设计复杂性较低的替代方案（例如 Longformer、Reformer、Linformer 和 Performer）直接解决了 Transformer 的局限性。然而，由于解决方案的范围广泛，研究人员和从业者确定在实践中应用哪些方法来满足容量、计算和内存之间所需的权衡变得具有挑战性。本调查通过调查使 Transformer 更快、更轻的流行方法来解决这个问题，并全面解释这些方法的优点、局限性和基本假设。对于研究人员和从业者来说，确定在实践中应用哪些方法来满足容量、计算和内存之间所需的权衡已成为一项挑战。本调查通过调查使 Transformer 更快、更轻的流行方法来解决这个问题，并全面解释这些方法的优点、局限性和基本假设。对于研究人员和从业者来说，确定在实践中应用哪些方法来满足容量、计算和内存之间所需的权衡已成为一项挑战。本调查通过调查使 Transformer 更快、更轻的流行方法来解决这个问题，并全面解释这些方法的优点、局限性和基本假设。

更新日期：2023-07-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11