Cross-Modal Attention With Semantic Consistence for Image-Text Matching.,IEEE Transactions on Neural Networks and Learning Systems

当前位置： X-MOL 学术 › IEEE Trans. Neural Netw. Learn. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cross-Modal Attention With Semantic Consistence for Image-Text Matching.
IEEE Transactions on Neural Networks and Learning Systems ( IF 10.2 ) Pub Date : 2020-02-11 , DOI: 10.1109/tnnls.2020.2967597
Xing Xu , Tan Wang , Yang Yang , Lin Zuo , Fumin Shen , Heng Tao Shen

The task of image-text matching refers to measuring the visual-semantic similarity between an image and a sentence. Recently, the fine-grained matching methods that explore the local alignment between the image regions and the sentence words have shown advance in inferring the image-text correspondence by aggregating pairwise region-word similarity. However, the local alignment is hard to achieve as some important image regions may be inaccurately detected or even missing. Meanwhile, some words with high-level semantics cannot be strictly corresponding to a single-image region. To tackle these problems, we address the importance of exploiting the global semantic consistence between image regions and sentence words as complementary for the local alignment. In this article, we propose a novel hybrid matching approach named Cross-modal Attention with Semantic Consistency (CASC) for image-text matching. The proposed CASC is a joint framework that performs cross-modal attention for local alignment and multilabel prediction for global semantic consistence. It directly extracts semantic labels from available sentence corpus without additional labor cost, which further provides a global similarity constraint for the aggregated region-word similarity obtained by the local alignment. Extensive experiments on Flickr30k and Microsoft COCO (MSCOCO) data sets demonstrate the effectiveness of the proposed CASC on preserving global semantic consistence along with the local alignment and further show its superior image-text matching performance compared with more than 15 state-of-the-art methods.

中文翻译：

语义一致性的跨模态注意用于图像-文本匹配。

图像文本匹配的任务是测量图像和句子之间的视觉语义相似度。近来，探索图像区域和句子词之间的局部对齐的细粒度匹配方法已经显示出通过聚集成对的区域词相似度来推断图像-文本对应性的进步。但是，由于可能无法正确检测甚至丢失一些重要的图像区域，因此很难实现局部对齐。同时，某些具有高级语义的单词不能严格对应于单个图像区域。为了解决这些问题，我们解决了利用图像区域和句子词之间的全局语义一致性作为对局部对齐的补充的重要性。在这篇文章中，我们提出了一种新颖的混合匹配方法，称为图像语义匹配的具有语义一致性的跨模式注意（CASC）。拟议的CASC是一个联合框架，该框架执行跨模式关注以进行局部对齐，并执行多标签预测以实现全局语义一致性。它直接从可用句子语料库中提取语义标签，而无需额外的人工成本，这进一步为通过局部对齐方式获得的聚合区域词相似性提供了全局相似性约束。在Flickr30k和Microsoft COCO（MSCOCO）数据集上进行的大量实验证明了所提出的CASC在保留全局语义一致性以及局部对齐方式方面的有效性，并且与15种以上的状态相比，还显示了其优越的图像-文本匹配性能。艺术方法。

更新日期：2020-02-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11