当前位置: X-MOL 学术IEEE Trans. Circ. Syst. Video Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multimodal Transformer with Multi-View Visual Representation for Image Captioning
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.4 ) Pub Date : 2020-12-01 , DOI: 10.1109/tcsvt.2019.2947482
Jun Yu , Jing Li , Zhou Yu , Qingming Huang

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.

中文翻译:

具有用于图像字幕的多视图视觉表示的多模态转换器

图像字幕旨在自动生成给定图像的自然语言描述,大多数最先进的模型都采用了编码器-解码器框架。该框架包括一个基于卷积神经网络 (CNN) 的图像编码器,它从输入图像中提取基于区域的视觉特征,以及一个基于循环神经网络 (RNN) 的字幕解码器,该解码器根据视觉特征生成输出字幕词注意力机制。尽管现有研究取得了成功,但当前的方法仅对表征模态间交互的共同注意进行建模,而忽略了表征模内交互的自我注意。受到 Transformer 模型在机器翻译中取得成功的启发,我们将其扩展到用于图像字幕的 Multimodal Transformer (MT) 模型。与现有的图像字幕方法相比,MT 模型同时捕获统一注意块中的模内和模间交互。由于此类注意力块的深度模块化组合,MT 模型可以执行复杂的多模态推理并输出准确的字幕。此外,为了进一步提高图像字幕性能,多视图视觉特征被无缝地引入到 MT 模型中。我们使用基准 MSCOCO 图像字幕数据集定量和定性地评估我们的方法,并进行广泛的消融研究以调查其有效性背后的原因。实验结果表明,我们的方法明显优于以前的最先进方法。由七个模型组成的合奏,
更新日期:2020-12-01
down
wechat
bug