Constrained LSTM and Residual Attention for Image Captioning,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Constrained LSTM and Residual Attention for Image Captioning
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.1 ) Pub Date : 2020-07-06 , DOI: 10.1145/3386725
Liang Yang ₁ , Haifeng Hu ₁ , Songlong Xing ₁ , Xinlong Lu ₁

Affiliation

Visual structure and syntactic structure are essential in images and texts, respectively. Visual structure depicts both entities in an image and their interactions, whereas syntactic structure in texts can reflect the part-of-speech constraints between adjacent words. Most existing methods either use visual global representation to guide the language model or generate captions without considering the relationships of different entities or adjacent words. Thus, their language models lack relevance in both visual and syntactic structure. To solve this problem, we propose a model that aligns the language model to certain visual structure and also constrains it with a specific part-of-speech template. In addition, most methods exploit the latent relationship between words in a sentence and pre-extracted visual regions in an image yet ignore the effects of unextracted regions on predicted words. We develop a residual attention mechanism to simultaneously focus on the pre-extracted visual objects and unextracted regions in an image. Residual attention is capable of capturing precise regions of an image corresponding to the predicted words considering both the effects of visual objects and unextracted regions. The effectiveness of our entire framework and each proposed module are verified on two classical datasets: MSCOCO and Flickr30k. Our framework is on par with or even better than the state-of-the-art methods and achieves superior performance on COCO captioning Leaderboard.

中文翻译：

用于图像描述的约束 LSTM 和剩余注意力

视觉结构和句法结构分别在图像和文本中是必不可少的。视觉结构描述了图像中的实体及其交互，而文本中的句法结构可以反映相邻单词之间的词性约束。大多数现有方法要么使用视觉全局表示来指导语言模型，要么生成字幕而不考虑不同实体或相邻单词的关系。因此，他们的语言模型在视觉和句法结构上都缺乏相关性。为了解决这个问题，我们提出了一个模型，该模型将语言模型与特定的视觉结构对齐，并使用特定的词性模板对其进行约束。此外，大多数方法利用句子中的单词与图像中预提取的视觉区域之间的潜在关系，但忽略未提取区域对预测单词的影响。我们开发了一种剩余注意力机制，以同时关注图像中预提取的视觉对象和未提取的区域。考虑到视觉对象和未提取区域的影响，剩余注意力能够捕获与预测词相对应的图像的精确区域。我们的整个框架和每个提出的模块的有效性都在两个经典数据集上进行了验证：MSCOCO 和 Flickr30k。我们的框架与最先进的方法相当甚至更好，并且在 COCO 字幕排行榜上取得了卓越的性能。我们开发了一种剩余注意力机制，以同时关注图像中预提取的视觉对象和未提取的区域。考虑到视觉对象和未提取区域的影响，剩余注意力能够捕获与预测词相对应的图像的精确区域。我们的整个框架和每个提出的模块的有效性都在两个经典数据集上进行了验证：MSCOCO 和 Flickr30k。我们的框架与最先进的方法相当甚至更好，并且在 COCO 字幕排行榜上取得了卓越的性能。我们开发了一种剩余注意力机制，以同时关注图像中预提取的视觉对象和未提取的区域。考虑到视觉对象和未提取区域的影响，剩余注意力能够捕获与预测词相对应的图像的精确区域。我们的整个框架和每个提出的模块的有效性都在两个经典数据集上进行了验证：MSCOCO 和 Flickr30k。我们的框架与最先进的方法相当甚至更好，并且在 COCO 字幕排行榜上取得了卓越的性能。考虑到视觉对象和未提取区域的影响，剩余注意力能够捕获与预测词相对应的图像的精确区域。我们的整个框架和每个提出的模块的有效性都在两个经典数据集上进行了验证：MSCOCO 和 Flickr30k。我们的框架与最先进的方法相当甚至更好，并且在 COCO 字幕排行榜上取得了卓越的性能。考虑到视觉对象和未提取区域的影响，剩余注意力能够捕获与预测词相对应的图像的精确区域。我们的整个框架和每个提出的模块的有效性都在两个经典数据集上进行了验证：MSCOCO 和 Flickr30k。我们的框架与最先进的方法相当甚至更好，并且在 COCO 字幕排行榜上取得了卓越的性能。

更新日期：2020-07-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>