当前位置: X-MOL 学术J. Visual Commun. Image Represent. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Attention-guided image captioning with adaptive global and local feature fusion
Journal of Visual Communication and Image Representation ( IF 2.6 ) Pub Date : 2021-06-01 , DOI: 10.1016/j.jvcir.2021.103138
Xian Zhong , Guozhang Nie , Wenxin Huang , Wenxuan Liu , Bo Ma , Chia-Wen Lin

Although attention mechanisms are exploited widely in encoder-decoder neural network-based image captioning framework, the relation between the selection of salient image regions and the supervision of spatial information on local and global representation learning was overlooked, thereby degrading captioning performance. Consequently, we propose an image captioning scheme based on adaptive spatial information attention (ASIA), extracting a sequence of spatial information of salient objects in a local image region or an entire image. Specifically, in the encoding stage, we extract the object-level visual features of salient objects and their spatial bounding-box. We obtain the global feature maps of an entire image, which are fused with local features and the fused features are fed into the LSTM-based language decoder. In the decoding stage, our adaptive attention mechanism dynamically selects the corresponding image regions specified by an image description. Extensive experiments conducted on two datasets demonstrate the effectiveness of the proposed method.



中文翻译:

具有自适应全局和局部特征融合的注意力引导图像字幕

尽管在基于编码器-解码器神经网络的图像字幕框架中广泛利用了注意力机制,但忽略了显着图像区域的选择与对局部和全局表示学习的空间信息的监督之间的关系,从而降低了字幕性能。因此,我们提出了一种基于自适应空间信息注意(ASIA)的图像字幕方案,提取局部图像区域或整个图像中显着对象的空间信息序列。具体来说,在编码阶段,我们提取显着对象及其空间边界框的对象级视觉特征。我们获得整个图像的全局特征图,将其与局部特征融合,并将融合特征输入基于 LSTM 的语言解码器。在解码阶段,我们的自适应注意力机制动态选择由图像描述指定的相应图像区域。在两个数据集上进行的大量实验证明了所提出方法的有效性。

更新日期:2021-06-18
down
wechat
bug