Rethinking and Improving Relative Position Encoding for Vision Transformer,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Rethinking and Improving Relative Position Encoding for Vision Transformer
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-07-29 , DOI: arxiv-2107.14222
Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, Hongyang Chao

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

中文翻译：

重新思考和改进 Vision Transformer 的相对位置编码

相对位置编码（RPE）对于转换器捕获输入标记的序列排序很重要。在自然语言处理中已经证明了一般有效性。然而，在计算机视觉中，它的功效还没有得到很好的研究，甚至还有争议，例如，相对位置编码是否可以和绝对位置一样有效？为了澄清这一点，我们首先回顾现有的相对位置编码方法，并分析它们在视觉变换器中应用时的优缺点。然后，我们提出了专用于 2D 图像的新相对位置编码方法，称为图像 RPE (iRPE)。我们的方法考虑了定向相对距离建模以及查询和自注意力机制中的相对位置嵌入之间的交互。所提出的 iRPE 方法简单且轻量级。它们可以轻松插入变压器块。实验表明，仅仅由于提出的编码方法，DeiT 和 DETR 在 ImageNet 和 COCO 上分别获得了比其原始版本高达 1.5%（top-1 Acc）和 1.3%（mAP）的稳定改进，而无需调整任何额外的超参数，例如学习率和权重衰减。我们的消融和分析也产生了有趣的发现，其中一些与之前的理解背道而驰。代码和模型在 https://github.com/microsoft/Cream/tree/main/iRPE 上开源。无需调整任何额外的超参数，例如学习率和权重衰减。我们的消融和分析也产生了有趣的发现，其中一些与之前的理解背道而驰。代码和模型在 https://github.com/microsoft/Cream/tree/main/iRPE 上开源。无需调整任何额外的超参数，例如学习率和权重衰减。我们的消融和分析也产生了有趣的发现，其中一些与之前的理解背道而驰。代码和模型在 https://github.com/microsoft/Cream/tree/main/iRPE 上开源。

更新日期：2021-07-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文