Learning Two-Branch Neural Networks for Image-Text Matching Tasks,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Two-Branch Neural Networks for Image-Text Matching Tasks
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2018-01-24 , DOI: 10.1109/tpami.2018.2797921
Liwei Wang , Yin Li , Jing Huang , Svetlana Lazebnik

Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output representations. The first one, referred to as an embedding network , learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. Compared to standard triplet sampling, we perform improved neighborhood sampling that takes neighborhood information into consideration while constructing mini-batches. The second network structure, referred to as a similarity network , fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets.

中文翻译：

学习用于图像-文本匹配任务的两分支神经网络

图像语言匹配任务最近在计算机视觉领域引起了很多关注。这些任务包括图像句子匹配（即，给定图像查询，检索相关句子，反之亦然）以及区域短语匹配或视觉基础，即，将短语与相关区域匹配。本文研究了两分支神经网络，以学习这两种数据模态之间的相似性。我们提出了两种产生不同输出表示形式的网络结构。第一个，称为嵌入网络学习具有最大边距损失和新颖邻域约束的显式共享潜在嵌入空间。与标准三元组采样相比，我们执行了改进的邻域采样，该邻域采样在构建小型批次时考虑了邻域信息。第二种网络结构，称为相似网络，通过元素乘积将两个分支融合在一起，并通过回归损失进行训练，以直接预测相似性得分。大量实验表明，我们的网络在Flickr30K实体数据集上实现短语定位以及在Flickr30K和MSCOCO数据集上进行双向图像句子检索均具有很高的准确性。

更新日期：2019-01-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>