Normalized and Geometry-Aware Self-Attention Network for Image Captioning,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Normalized and Geometry-Aware Self-Attention Network for Image Captioning
arXiv - CS - Multimedia Pub Date : 2020-03-19 , DOI: arxiv-2003.08897
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu

Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and demonstrate that it is both possible and beneficial to perform it on the hidden activations inside SA. Second, to compensate for the major limit of Transformer that it fails to model the geometry structure of the input objects, we propose a class of Geometry-aware Self-Attention (GSA) that extends SA to explicitly and efficiently consider the relative geometry relations between the objects in the image. To construct our image captioning model, we combine the two modules and apply it to the vanilla self-attention network. We extensively evaluate our proposals on MS-COCO image captioning dataset and superior results are achieved when comparing to state-of-the-art approaches. Further experiments on three challenging tasks, i.e. video captioning, machine translation, and visual question answering, show the generality of our methods.

中文翻译：

用于图像字幕的标准化和几何感知自注意力网络

Self-attention (SA) 网络在图像字幕中显示出深远的价值。在本文中，我们从两个方面改进 SA，以提升图像字幕的性能。首先，我们提出了归一化自注意力（NSA），这是一种 SA 的重新参数化，它带来了 SA 内部归一化的好处。虽然规范化以前只应用于 SA 之外，但我们引入了一种新的规范化方法，并证明在 SA 内部的隐藏激活上执行它既可能又有益。其次，为了弥补 Transformer 无法对输入对象的几何结构进行建模的主要限制，我们提出了一类几何感知自注意力（GSA），它扩展了 SA 以明确有效地考虑之间的相对几何关系图像中的对象。为了构建我们的图像字幕模型，我们将这两个模块结合起来，并将其应用于 vanilla self-attention 网络。我们在 MS-COCO 图像字幕数据集上广泛评估了我们的建议，与最先进的方法相比，取得了优异的结果。在三个具有挑战性的任务上的进一步实验，即视频字幕、机器翻译和视觉问答，展示了我们方法的通用性。

更新日期：2020-03-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>