Dual-path Convolutional Image-Text Embeddings with Instance Loss,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dual-path Convolutional Image-Text Embeddings with Instance Loss
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.2 ) Pub Date : 2020-05-25 , DOI: 10.1145/3383184
Zhedong Zheng ₁ , Liang Zheng ₂ , Michael Garrett ₃ , Yi Yang ₁ , Mingliang Xu ₄ , Yi-Dong Shen ₅

Affiliation

Matching images and sentences demands a fine understanding of both modalities. In this article, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image/text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss on heterogeneous features (i.e., text and image features) is less effective, because it is hard to find appropriate triplets at the beginning. So the naive way of using the ranking loss may compromise the network from learning inter-modal relationship. To address this problem, we propose the instance loss, which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image/text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this article constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language-based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.

中文翻译：

具有实例损失的双路径卷积图像-文本嵌入

匹配图像和句子需要对这两种方式都有很好的理解。在本文中，我们提出了一种新系统，可以将图像和文本有区别地嵌入到共享的视觉-文本空间中。在这个领域，大多数现有的工作都应用排名损失来将正图像/文本对拉近，并将负图像/文本对相互推开。然而，直接将排序损失部署在异构特征（即文本和图像特征）上效果较差，因为在开始时很难找到合适的三元组。因此，使用排名损失的幼稚方式可能会损害网络学习模式间关系。为了解决这个问题，我们提出了实例损失，它明确地考虑了模态内数据分布。它基于一个无监督的假设，即每个图像/文本组都可以被视为一个类。因此网络可以从每个图像/文本组中学习细粒度。实验表明，instance loss 为ranking loss 提供了更好的权重初始化，从而可以学习到更具判别性的嵌入。此外，现有作品通常应用现成的特征，即word2vec和固定视觉特征。因此，作为一个小贡献，本文构建了一个端到端的双路径卷积网络来学习图像和文本表示。端到端学习允许系统直接从数据中学习并充分利用监督。在两个通用检索数据集（Flickr30k 和 MSCOCO）上，实验表明，与最先进的方法相比，我们的方法产生了具有竞争力的准确性。而且，在基于语言的人物检索中，我们大大提高了最先进的水平。该代码已公开发布。

更新日期：2020-05-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文