Similarity Reasoning and Filtration for Image-Text Matching,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Similarity Reasoning and Filtration for Image-Text Matching
arXiv - CS - Multimedia Pub Date : 2021-01-05 , DOI: arxiv-2101.01368
Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu

Image-text matching plays a critical role in bridging the vision and language, and great progress has been made by exploiting the global alignment between image and sentence, or local alignments between regions and words. However, how to make the most of these alignments to infer more accurate matching scores is still underexplored. In this paper, we propose a novel Similarity Graph Reasoning and Attention Filtration (SGRAF) network for image-text matching. Specifically, the vector-based similarity representations are firstly learned to characterize the local and global alignments in a more comprehensive manner, and then the Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments. The Similarity Attention Filtration (SAF) module is further developed to integrate these alignments effectively by selectively attending on the significant and representative alignments and meanwhile casting aside the interferences of non-meaningful alignments. We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets, and the good interpretability of SGR and SAF modules with extensive qualitative experiments and analyses.

中文翻译：

图像文本匹配的相似性推理和过滤

图像-文本匹配在桥接视觉和语言方面起着至关重要的作用，并且通过利用图像和句子之间的全局对齐方式或区域和单词之间的局部对齐方式已经取得了巨大的进步。但是，如何充分利用这些比对来推断更准确的匹配分数仍未得到充分研究。在本文中，我们提出了一种新颖的相似图推理和注意力过滤（SGRAF）网络，用于图像-文本匹配。具体来说，首先学习基于向量的相似性表示以更全面的方式表征局部和全局比对，然后引入依赖于一个图卷积神经网络的相似性图推理（SGR）模块来推断与局部和全局对齐。相似注意过滤（SAF）模块经过进一步开发，可通过有选择地参与重要的和具有代表性的比对并同时消除无意义的比对的干扰来有效地整合这些比对。我们通过在Flickr30K和MSCOCO数据集上实现最先进的性能，证明了该方法的优越性，并且通过大量的定性实验和分析，SGR和SAF模块具有良好的可解释性。

更新日期：2021-01-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文