当前位置: X-MOL 学术Displays › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Boosting convolutional image captioning with semantic content and visual relationship
Displays ( IF 3.7 ) Pub Date : 2021-08-23 , DOI: 10.1016/j.displa.2021.102069
Cong Bai 1 , Anqi Zheng 1 , Yuan Huang 1 , Xiang Pan 1 , Nan Chen 2
Affiliation  

Image captioning aims to display automatically the natural language sentence for the image by the computer, which is an important but a challenging task which covers the fields of computer vision and natural language processing. This task is dominated by Long-short term memory (LSTM) based solutions. Although many progresses have been made based on LSTM in recent years, the model based on LSTM relies on serialized generation of descriptions, which cannot be processed in parallel and pay less attentions to the hierarchical structure of the captions. In order to solve this problem, we propose a framework using a CNN-based generation model to generate image captions with the help of conditional generative adversarial training (CGAN). Furthermore, multi-modal graph convolution network(MGCN) is used to exploit visual relationships between objects for generating the captions with semantic meanings, in which the scene graph is used as the bridge to connect objects, attributes and visual relationship information together to generate better captions. Extensive experiments are conducted on MSCOCO database and the results show that our method could achieve better or comparable scores compared with state-of-the-art methods. Ablation experimental results show that CGAN and MGCN can reflect a better visual relationships between objects in image and thus can generate better captions with richer semantic content.



中文翻译:

使用语义内容和视觉关系提升卷积图像字幕

Image captioning旨在通过计算机自动显示图像的自然语言句子,这是一项重要但具有挑战性的任务,涵盖了计算机视觉和自然语言处理领域。该任务由基于长短期记忆 (LSTM) 的解决方案主导。虽然近年来基于LSTM取得了很多进展,但基于LSTM的模型依赖于描述的序列化生成,无法并行处理,对字幕的层次结构关注较少。为了解决这个问题,我们提出了一个框架,使用基于 CNN 的生成模型在条件生成对抗训练 (CGAN) 的帮助下生成图像标题。此外,多模态图卷积网络(MGCN)用于利用对象之间的视觉关系生成具有语义意义的字幕,其中场景图作为桥梁将对象、属性和视觉关系信息连接在一起以生成更好的字幕。在 MSCOCO 数据库上进行了大量实验,结果表明,与最先进的方法相比,我们的方法可以获得更好或可比的分数。消融实验结果表明,CGAN 和 MGCN 可以更好地反映图像中对象之间的视觉关系,从而可以生成具有更丰富语义内容的更好的字幕。属性和视觉关系信息一起生成更好的字幕。在 MSCOCO 数据库上进行了大量实验,结果表明,与最先进的方法相比,我们的方法可以获得更好或可比的分数。消融实验结果表明,CGAN 和 MGCN 可以更好地反映图像中对象之间的视觉关系,从而可以生成具有更丰富语义内容的更好的字幕。属性和视觉关系信息一起生成更好的字幕。在 MSCOCO 数据库上进行了大量实验,结果表明,与最先进的方法相比,我们的方法可以获得更好或可比的分数。消融实验结果表明,CGAN 和 MGCN 可以更好地反映图像中对象之间的视觉关系,从而可以生成具有更丰富语义内容的更好的字幕。

更新日期:2021-10-02
down
wechat
bug