当前位置: X-MOL 学术Neurocomputing › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Image Captioning with Semantic-Enhanced Features and Extremely Hard Negative Examples
Neurocomputing ( IF 6 ) Pub Date : 2020-11-01 , DOI: 10.1016/j.neucom.2020.06.112
Wenjie Cai , Qiong Liu

Abstract Image captioning is a task to generate natural descriptions of images. In existing image captioning models, the generated captions usually lack semantic discriminability. Semantic discriminability is difficult as it requires the model to capture detailed differences in images. In this paper, we propose an image captioning framework with semantic-enhanced features and extremely hard negative examples. These two components are combined in a Semantic-Enhanced Module. The semantic-enhanced module consists of an image-text matching sub-network and a Feature Fusion layer which provides semantic-enhanced features of rich semantic information. Moreover, in order to improve the semantic discriminability, we propose an extremely hard negative mining method which utilize the extremely hard negative examples to improve the latent alignment between visual and language information. Experimental results on MSCOCO and Flickr30K show that our proposed framework and training method can simultaneously improve the performance of image-text matching and image captioning, achieving competitive performance against state-of-the-art methods.

中文翻译:

具有语义增强特征和极难反例的图像字幕

Abstract Image captioning 是一项生成图像自然描述的任务。在现有的图像字幕模型中,生成的字幕通常缺乏语义可辨性。语义区分是困难的,因为它需要模型捕捉图像中的细节差异。在本文中,我们提出了一种具有语义增强特征和极难反例的图像字幕框架。这两个组件组合在一个语义增强模块中。语义增强模块由图文匹配子网络和提供丰富语义信息的语义增强特征的特征融合层组成。此外,为了提高语义可辨别性,我们提出了一种极难的负面挖掘方法,它利用极难的负面例子来改善视觉和语言信息之间的潜在对齐。在 MSCOCO 和 Flickr30K 上的实验结果表明,我们提出的框架和训练方法可以同时提高图像文本匹配和图像字幕的性能,实现与最先进方法的竞争性能。
更新日期:2020-11-01
down
wechat
bug