当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-Head Attention: Collaborate Instead of Concatenate
arXiv - CS - Machine Learning Pub Date : 2020-06-29 , DOI: arxiv-2006.16362
Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi

Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. However, they suffer from over-parameterization. For instance, it was shown that the majority of attention heads could be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that trained attention heads share common key/query projections, we propose a collaborative multi-head attention layer that enables heads to learn shared projections. Our scheme improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. For instance, by allowing heads to collaborate on a neural machine translation task, we can reduce the key dimension by a factor of eight without any loss in performance. We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer. Even without retraining, collaborative multi-head attention manages to reduce the size of the key and query projections by half without sacrificing accuracy. Our code is public.

中文翻译:

多头注意力:协作而不是串联

注意层广泛用于自然语言处理 (NLP),并开始影响计算机视觉架构。然而,它们受到过度参数化的影响。例如,研究表明可以在不影响准确性的情况下修剪大多数注意力头。这项工作旨在增强当前对多个头如何交互的理解。受训练的注意力头共享公共键/查询投影的观察的启发,我们提出了一个协作多头注意力层,使头能够学习共享投影。我们的方案提高了注意力层中的计算成本和参数数量,并且可以用作任何变压器架构中的替代品。例如,通过允许头部协作完成神经机器翻译任务,我们可以将关键维度减少八倍,而不会损失任何性能。我们还表明可以将预训练的多头注意力层重新参数化到我们的协作注意力层中。即使没有重新训练,协作多头注意力也设法在不牺牲准确性的情况下将键和查询投影的大小减少了一半。我们的代码是公开的。
更新日期:2020-07-01
down
wechat
bug