avtmNet:Adaptive Visual-Text Merging Network for Image Captioning,Computers & Electrical Engineering

当前位置： X-MOL 学术 › Comput. Electr. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

avtmNet:Adaptive Visual-Text Merging Network for Image Captioning
Computers & Electrical Engineering ( IF 4.0 ) Pub Date : 2020-06-01 , DOI: 10.1016/j.compeleceng.2020.106630
Heng Song , Junwu Zhu , Yi Jiang

Abstract Recently, researchers have made extensive research on the technology of automatically generating descriptions for an image. Various technologies for image captioning have been proposed, among which attention-based encoder-decoder framework achieved great success. Two different types of attention models are proposed to generate image captions respectively, i.e., model based visual attention that is good at describing details, and model based text attention that is good at comprehensive understanding. In order to integrate and make full use of visual information and text information to generate more accurate captions for images, in this paper, we firstly introduce a visual attention model to generate the visual information and a text attention model to form the text information respectively, and then propose an adaptive visual-text merging network(avtmNet). This merging network can effectively merge the visual information and text information, and automatically determine the proportion of both visual information and text information to generate the next caption word. Extensive experiments are performed on the datasets named COCO2014 and Flickr30K respectively, and show the effectiveness and superiority of our proposed approach.

中文翻译：

avtmNet：用于图像字幕的自适应视觉文本合并网络

摘要近年来，研究人员对自动生成图像描述的技术进行了广泛的研究。已经提出了各种图像字幕技术，其中基于注意力的编码器-解码器框架取得了巨大成功。提出了两种不同类型的注意力模型来分别生成图像标题，即擅长描述细节的基于模型的视觉注意力和擅长综合理解的基于模型的文本注意力。为了整合和充分利用视觉信息和文本信息来为图像生成更准确的字幕，本文首先引入了视觉注意力模型来生成视觉信息和文本注意力模型来形成文本信息，然后提出一个自适应视觉文本合并网络（avtmNet）。这种合并网络可以有效地合并视觉信息和文本信息，并自动确定视觉信息和文本信息的比例来生成下一个字幕词。分别在名为 COCO2014 和 Flickr30K 的数据集上进行了大量实验，并展示了我们提出的方法的有效性和优越性。

更新日期：2020-06-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11