Image captioning using DenseNet network and adaptive attention,Signal Processing: Image Communication

当前位置： X-MOL 学术 › Signal Process. Image Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Image captioning using DenseNet network and adaptive attention
Signal Processing: Image Communication ( IF 3.4 ) Pub Date : 2020-03-19 , DOI: 10.1016/j.image.2020.115836
Zhenrong Deng , Zhouqin Jiang , Rushi Lan , Wenming Huang , Xiaonan Luo

Considering the image captioning problem, it is difficult to correctly extract the global features of the images. At the same time, most attention methods force each word to correspond to the image region, ignoring the phenomenon that words such as “the” in the description text cannot correspond to the image region. To address these problems, an adaptive attention model with a visual sentinel is proposed in this paper. In the encoding phase, the model introduces DenseNet to extract the global features of the image. At the same time, on each time axis, the sentinel gate is set by the adaptive attention mechanism to decide whether to use the image feature information for word generation. In the decoding phase, the long short-term memory (LSTM) network is applied as a language generation model for image captioning tasks to improve the quality of image caption generation. Experiments on the Flickr30k and COCO datasets indicate that the proposed model exhibits significant improvement in terms ofthe BLEU and METEOR evaluation criteria.

中文翻译：

使用DenseNet网络和自适应注意力进行图像字幕

考虑到图像字幕问题，很难正确提取图像的全局特征。同时，大多数注意方法都迫使每个单词对应于图像区域，而忽略了描述文本中诸如“ the”之类的单词不能对应于图像区域的现象。为了解决这些问题，本文提出了一种具有视觉前哨的自适应注意力模型。在编码阶段，该模型引入了DenseNet来提取图像的全局特征。同时，在每个时间轴上，前哨门由自适应注意机制设置，以决定是否将图像特征信息用于单词生成。在解码阶段，长短期记忆（LSTM）网络被用作图像字幕任务的语言生成模型，以提高图像字幕生成的质量。Flickr30k和COCO数据集上的实验表明，所提出的模型在BLEU和METEOR评估标准方面显示出显着的改进。

更新日期：2020-03-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文