当前位置: X-MOL 学术arXiv.cs.CV › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-05-06 , DOI: arxiv-2105.02723
Luke Melas-Kyriazi

The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed-forward layers applied over the patch and feature dimensions in an alternating fashion. In experiments on ImageNet, this architecture performs surprisingly well: a ViT/DeiT-base-sized model obtains 74.9\% top-1 accuracy, compared to 77.9\% and 79.9\% for ViT and DeiT respectively. These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought. We hope these results prompt the community to spend more time trying to understand why our current models are as effective as they are.

中文翻译:

您甚至需要注意吗?令人惊讶的是,一堆前馈层在ImageNet上表现出色

视觉转换器在图像分类和其他视觉任务上的强大性能通常归因于其多头注意力层的设计。但是,目前尚不清楚引起这种强劲表现的程度。在这份简短的报告中,我们问:注意层是否甚至必要?具体来说,我们将视觉转换器中的注意力层替换为贴片尺寸上应用的前馈层。最终的体系结构只是一系列以交替方式应用于贴片和特征尺寸的前馈层。在ImageNet上进行的实验中,该架构的性能出奇地好:基于ViT / DeiT的模型获得了74.9%的top-1准确性,而ViT和DeiT分别为77.9%和79.9%。这些结果表明,除了注意力以外,视觉转换器的其他方面(例如贴片嵌入)可能比以前认为的要负责。我们希望这些结果能促使社区花费更多时间来理解为什么我们当前的模型如此有效。
更新日期:2021-05-07
down
wechat
bug