Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 8-17-2022 , DOI: 10.1109/tip.2022.3197972
Zhangxiang Shi ₁ , Tianzhu Zhang ₁ , Xi Wei ₁ , Feng Wu ₁ , Yongdong Zhang ₁

Affiliation

The mainstream of image and sentence matching studies currently focuses on fine-grained alignment of image regions and sentence words. However, these methods miss a crucial fact: the correspondence between images and sentences does not simply come from alignments between individual regions and words but from alignments between the phrases they form respectively. In this work, we propose a novel Decoupled Cross-modal Phrase-Attention network (DCPA) for image-sentence matching by modeling the relationships between textual phrases and visual phrases. Furthermore, we design a novel decoupled manner for training and inferencing, which is able to release the trade-off for bi-directional retrieval, where image-to-sentence matching is executed in textual semantic space and sentence-to-image matching is executed in visual semantic space. Extensive experimental results on Flickr30K and MS-COCO demonstrate that the proposed method outperforms state-of-the-art methods by a large margin, and can compete with some methods introducing external knowledge.

中文翻译：

用于图像句子匹配的解耦跨模态短语注意网络

目前图像和句子匹配研究的主流集中在图像区域和句子单词的细粒度对齐。然而，这些方法忽略了一个关键事实：图像和句子之间的对应关系不仅仅来自各个区域和单词之间的对齐，还来自它们分别形成的短语之间的对齐。在这项工作中，我们提出了一种新颖的解耦跨模态短语注意网络（DCPA），通过对文本短语和视觉短语之间的关系进行建模来进行图像句子匹配。此外，我们设计了一种新颖的训练和推理解耦方式，能够释放双向检索的权衡，其中图像到句子匹配在文本语义空间中执行，句子到图像匹配在文本语义空间中执行在视觉语义空间中。在 Flickr30K 和 MS-COCO 上的大量实验结果表明，所提出的方法大大优于最先进的方法，并且可以与一些引入外部知识的方法竞争。

更新日期：2024-08-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11