MAVA: Multi-level Adaptive Visual-textual Alignment by Cross-media Bi-attention Mechanism.,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MAVA: Multi-level Adaptive Visual-textual Alignment by Cross-media Bi-attention Mechanism.
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2019-11-22 , DOI: 10.1109/tip.2019.2952085
Yuxin Peng , Jinwei Qi , Yunkan Zhuo

The rapidly developing information technology leads to a fast growth of visual and textual contents, and it comes with huge challenges to make correlation and perform crossmedia retrieval between images and sentences. Existing methods mainly explore cross-media correlation from either global-level instances as the whole images and sentences, or local-level fine-grained patches as the discriminative image regions and key words, which ignore the complementary information from the relation between local-level fine-grained patches. Naturally, relation understanding is highly important for learning crossmedia correlation. People focus on not only the alignment between discriminative image regions and key words, but also their relations lying in the visual and textual context. Therefore, in this paper, we propose Multi-level Adaptive Visual-textual Alignment (MAVA) approach with the following contributions. First, we propose cross-media multi-pathway fine-grained network to extract not only the local fine-grained patches as discriminative image regions and key words, but also visual relations between image regions as well as textual relations from the context of sentences, which contain complementary information to exploit fine-grained characteristics within different media types. Second, we propose visual-textual bi-attention mechanism to distinguish the fine-grained information with different saliency from both local and relation levels, which can provide more discriminative hints for correlation learning. Third, we propose cross-media multi-level adaptive alignment to explore global, local and relation alignments. An adaptive alignment strategy is further proposed to enhance the matched pairs of different media types, and discard those misalignments adaptively to learn more precise cross-media correlation. Extensive experiments are conducted to perform image-sentence matching on 2 widely-used cross-media datasets, namely Flickr-30K and MS-COCO, comparing with 10 state-of-the-art methods, which can fully verify the effectiveness of our proposed MAVA approach.

中文翻译：

MAVA：跨媒体双向注意机制的多级自适应视觉文本对齐。

信息技术的快速发展导致视觉和文本内容的快速增长，图像和句子之间的关联和跨媒体检索面临着巨大的挑战。现有方法主要从全局级实例作为整个图像和句子，或局部级细粒度补丁作为判别性图像区域和关键词来探索跨媒体相关性，忽略了局部级实例之间关系的补充信息细粒度的补丁。当然，关系理解对于学习跨媒体关联非常重要。人们不仅关注有区别的图像区域和关键词之间的对齐，还关注它们在视觉和文本上下文中的关系。因此，在本文中，我们提出了多级自适应视觉文本对齐（MAVA）方法，其贡献如下。首先，我们提出跨媒体多路径细粒度网络，不仅可以提取局部细粒度斑块作为判别图像区域和关键词，还可以提取图像区域之间的视觉关系以及句子上下文中的文本关系，其中包含补充信息，以利用不同媒体类型中的细粒度特征。其次，我们提出了视觉-文本双向注意机制，从局部和关系层面区分具有不同显着性的细粒度信息，这可以为相关学习提供更多的判别性提示。第三，我们提出跨媒体多层次自适应对齐来探索全局、局部和关系对齐。进一步提出了自适应对齐策略来增强不同媒体类型的匹配对，并自适应地丢弃这些未对齐以学习更精确的跨媒体相关性。我们在 2 个广泛使用的跨媒体数据集 Flickr-30K 和 MS-COCO 上进行了大量的实验，与 10 种最先进的方法进行了比较，可以充分验证我们提出的方法的有效性MAVA 方法。

更新日期：2020-04-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11