Image understanding via learning weakly-supervised cross-modal semantic translation,Journal of Visual Communication and Image Representation

当前位置： X-MOL 学术 › J. Visual Commun. Image Represent. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Image understanding via learning weakly-supervised cross-modal semantic translation
Journal of Visual Communication and Image Representation ( IF 2.6 ) Pub Date : 2020-03-23 , DOI: 10.1016/j.jvcir.2020.102789
Guorong Shen

Fusing cross-modal features is significant for image understanding, which aims at describing objects inside an image by optimally combining multiple visual channels. In the literature, low-level based multimodal feature fusion have achieved impressive performance. However, the semantic gap is a big limitation, i.e., these methods cannot reflect the how humans perceive image semantic objects. Supervised learning-based methods require intolerably expensive manual labeling, which is not a good choice in practice. To alleviate these limitations, we present an image understanding method by learning weakly-supervised based cross-modal semantic translation. More specifically, we design a manifold embedding algorithm to automatically translate image-level text semantic labels into several pixel-level image regions. Subsequently, we leverage a three-level spatial pyramid model to extract both local and global features of objects from training images. Afterwards, these cross-modal features are seamlessly concatenated to form a multiple feature matrix. Afterwards, these cross-modal features are seamlessly concatenated to form a multiple feature matrix. The feature matrix can be employed to learn a kernel SVM and ranking SVM for image classification and retrieval respectively. Comprehensive experiments on image recognition, classification and retrieval have demonstrated the effectiveness of our method.

中文翻译：

通过学习弱监督跨模态语义翻译来理解图像

融合跨模态特征对于图像理解非常重要，它旨在通过最佳地组合多个视觉通道来描述图像内部的对象。在文献中，基于低级的多峰特征融合取得了令人印象深刻的性能。但是，语义鸿沟是一个很大的限制，即，这些方法无法反映人类如何感知图像语义对象。有监督的基于学习的方法需要难以忍受的昂贵的手动标记，这在实践中不是一个好的选择。为了减轻这些局限性，我们通过学习基于弱监督的跨模式语义翻译，提出了一种图像理解方法。更具体地说，我们设计了一种流形嵌入算法，以将图像级文本语义标签自动转换为几个像素级图像区域。后来，我们利用三级空间金字塔模型从训练图像中提取对象的局部和全局特征。之后，将这些交叉模式特征无缝地连接起来以形成多特征矩阵。之后，将这些交叉模式特征无缝地连接起来以形成多特征矩阵。特征矩阵可用于学习内核SVM和对SVM进行排名，以分别进行图像分类和检索。图像识别，分类和检索的综合实验证明了我们方法的有效性。这些交叉模式特征被无缝地连接起来以形成多特征矩阵。特征矩阵可用于学习内核SVM和对SVM进行排名，以分别进行图像分类和检索。图像识别，分类和检索的综合实验证明了我们方法的有效性。这些交叉模式特征被无缝地连接起来以形成多特征矩阵。特征矩阵可用于学习内核SVM和对SVM进行排名，以分别进行图像分类和检索。图像识别，分类和检索的综合实验证明了我们方法的有效性。

更新日期：2020-03-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11