Image Captioning with a Joint Attention Mechanism by Visual Concept Samples,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Image Captioning with a Joint Attention Mechanism by Visual Concept Samples
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.2 ) Pub Date : 2020-07-06 , DOI: 10.1145/3394955
Jin Yuan ₁ , Lei Zhang ₁ , Songrui Guo ₁ , Yi Xiao ₁ , Zhiyong Li ₁

Affiliation

The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could offer accurate subregions to train a model, the learned captioner may predict wrong, especially for visual concept words, which are the most important parts to understand an image. To tackle the preceding problem, in this article we propose Visual Concept Enhanced Captioner, which employs a joint attention mechanism with visual concept samples to strengthen prediction abilities for visual concepts in image captioning. Different from traditional attention approaches that adopt one LSTM to explore one noticed subregion each time, Visual Concept Enhanced Captioner introduces multiple virtual LSTMs in parallel to simultaneously receive multiple subregions from visual concept samples. Then, the model could update parameters by jointly exploring these subregions according to a composite loss function. Technically, this joint learning is helpful in finding the common characters of a visual concept, and thus it enhances the prediction accuracy for visual concepts. Moreover, by integrating diverse visual concept samples from different domains, our model can be extended to bridge visual bias in cross-domain learning for image captioning, which saves the cost for labeling captions. Extensive experiments have been conducted on two image datasets (MSCOCO and Flickr30K), and superior results are reported when comparing to state-of-the-art approaches. It is impressive that our approach could significantly increase BLUE-1 and F1 scores, which demonstrates an accuracy improvement for visual concepts in image captioning.

中文翻译：

视觉概念样本的联合注意力机制图像描述

注意力机制已被确立为在图像字幕中生成字幕词的有效方法；它探索图像中一个注意到的子区域以预测相关的标题词。然而，即使注意力机制可以提供准确的子区域来训练模型，学习的字幕可能会预测错误，尤其是对于理解图像最重要的部分视觉概念词。为了解决上述问题，在本文中，我们提出了视觉概念增强字幕，它采用与视觉概念样本的联合注意力机制来增强图像字幕中视觉概念的预测能力。与传统的注意力方法不同，传统的注意力方法采用一个 LSTM 每次探索一个被注意到的子区域，Visual Concept Enhanced Captioner 并行引入多个虚拟 LSTM，以同时接收来自视觉概念样本的多个子区域。然后，模型可以通过根据复合损失函数联合探索这些子区域来更新参数。从技术上讲，这种联合学习有助于找到视觉概念的共同特征，从而提高视觉概念的预测精度。此外，通过整合来自不同领域的不同视觉概念样本，我们的模型可以扩展到弥合图像字幕跨领域学习中的视觉偏差，从而节省了标注字幕的成本。已经在两个图像数据集（MSCOCO 和 Flickr30K）上进行了广泛的实验，与最先进的方法相比，报告了更好的结果。

更新日期：2020-07-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文