MAFormer: A transformer network with multi-scale attention fusion for visual recognition,Neurocomputing

当前位置： X-MOL 学术 › Neurocomputing › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MAFormer: A transformer network with multi-scale attention fusion for visual recognition
Neurocomputing ( IF 6 ) Pub Date : 2024-05-10 , DOI: 10.1016/j.neucom.2024.127828
Huixin Sun , Yunhao Wang , Xiaodi Wang , Bin Zhang , Ying Xin , Baochang Zhang , Xianbin Cao , Errui Ding , Shumin Han

Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. However conventional vision transformers often focus on global dependency at a coarse level, which results in a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. Our MAFormer achieves state-of-the-art results on several common vision tasks. In particular, MAFormer-L achieves 85.9% Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7% and 0.6% respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7% mAPs on object detection and 1.4% on instance segmentation with similar-sized parameters. With the performance, MAFormer demonstrates the ability to generalize across various visual benchmarks and prospects as a general backbone for different self-supervised pre-training tasks in the future.

中文翻译：

MAFormer：用于视觉识别的多尺度注意力融合的变压器网络

Vision Transformer 及其变体在各种计算机视觉任务中表现出了巨大的潜力。然而，传统的视觉变换器通常关注粗略级别的全局依赖性，这导致对全局关系和令牌级别的细粒度表示的学习挑战。在本文中，我们将多尺度注意力融合引入到 Transformer () 中，它探索了视觉识别双流框架中的局部聚合和全局特征提取。我们开发了一个简单但有效的模块，通过在令牌级别学习细粒度和粗粒度特征并动态融合它们来探索 Transformer 在视觉表示方面的全部潜力。我们的多尺度注意力融合（MAF）块包括：i）本地窗口注意力分支，用于学习窗口内的短程交互，聚合细粒度的局部特征； ii) 通过新颖的全局下采样学习（GLD）操作进行全局特征提取，以有效捕获整个图像内的远程上下文信息； iii）一个融合模块，通过注意力自我探索两个特征的整合。我们的 MAFormer 在几个常见的视觉任务上取得了最先进的结果。特别是，MAFormer-L 在 ImageNet 上实现了 85.9% 的 Top-1 准确率，分别超过 CSWin-B 和 LV-ViT-L 1.7% 和 0.6%。在 MSCOCO 上，MAFormer 在对象检测方面优于现有技术 CSWin 1.7% 的 mAP，在具有相似大小参数的实例分割方面优于现有技术 CSWin 1.4%。通过该性能，MAFormer 展示了泛化各种视觉基准和前景的能力，作为未来不同自监督预训练任务的通用支柱。

更新日期：2024-05-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>