The synergy of double attention: Combine sentence-level and word-level attention for image captioning,Computer Vision and Image Understanding

当前位置： X-MOL 学术 › Comput. Vis. Image Underst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The synergy of double attention: Combine sentence-level and word-level attention for image captioning
Computer Vision and Image Understanding ( IF 4.5 ) Pub Date : 2020-08-22 , DOI: 10.1016/j.cviu.2020.103068
Haiyang Wei , Zhixin Li , Canlong Zhang , Huifang Ma

The existing attention models of image captioning typically extract only word-level attention information, i.e., the attention mechanism extracts local attention information from the image to generate the current word, and lacks accurate image global information guidance. In this paper, we first propose an image captioning approach based on self-attention. Sentence-level attention information is extracted from the image through self-attention mechanism to represent the global image information needed to generate sentences. Furthermore, we propose a double attention model which combines sentence-level attention model with word-level attention model to generate more accurate captions. We implement supervision and optimization in the intermediate stage of the model to solve information interference problems. In addition, we perform two-stage training with reinforcement learning to optimize the evaluation metric of the model. Finally, we evaluated our model on three standard datasets, i.e., Flickr8k, Flickr30k and MSCOCO. Experimental results show that our double attention model can generate more accurate and richer captions, and outperforms many state-of-the-art image captioning approaches in various evaluation metrics.

中文翻译：

双重关注的协同作用：将句子级和单词级注意力结合在一起以进行图像字幕

现有的图像字幕关注模型通常仅提取单词级别的关注信息，即关注机制从图像提取局部关注信息以生成当前单词，缺乏准确的图像全局信息指导。在本文中，我们首先提出一种基于自注意力的图像字幕方法。通过自我关注机制从图像中提取句子级别的关注信息，以表示生成句子所需的全局图像信息。此外，我们提出了一种双重注意模型，该模型将句子级注意模型与单词级注意模型结合起来以生成更准确的字幕。我们在模型的中间阶段实施监督和优化，以解决信息干扰问题。此外，我们通过强化学习进行两阶段训练，以优化模型的评估指标。最后，我们在三个标准数据集（即Flickr8k，Flickr30k和MSCOCO）上评估了我们的模型。实验结果表明，我们的双重关注模型可以生成更准确，更丰富的字幕，并且在各种评估指标中均优于许多最新的图像字幕方法。

更新日期：2020-08-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>