当前位置: X-MOL 学术J. Assoc. Inf. Sci. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cross‐modal retrieval with dual multi‐angle self‐attention
Journal of the Association for Information Science and Technology ( IF 2.8 ) Pub Date : 2020-07-16 , DOI: 10.1002/asi.24373
Wenjie Li 1 , Yi Zheng 1 , Yuejie Zhang 1 , Rui Feng 1 , Tao Zhang 2 , Weiguo Fan 3
Affiliation  

In recent years, cross‐modal retrieval has been a popular research topic in both fields of computer vision and natural language processing. There is a huge semantic gap between different modalities on account of heterogeneous properties. How to establish the correlation among different modality data faces enormous challenges. In this work, we propose a novel end‐to‐end framework named Dual Multi‐Angle Self‐Attention (DMASA) for cross‐modal retrieval. Multiple self‐attention mechanisms are applied to extract fine‐grained features for both images and texts from different angles. We then integrate coarse‐grained and fine‐grained features into a multimodal embedding space, in which the similarity degrees between images and texts can be directly compared. Moreover, we propose a special multistage training strategy, in which the preceding stage can provide a good initial value for the succeeding stage and make our framework work better. Very promising experimental results over the state‐of‐the‐art methods can be achieved on three benchmark datasets of Flickr8k, Flickr30k, and MSCOCO.

中文翻译:

双多角度自注意力的跨模态检索

近年来,跨模态检索一直是计算机视觉和自然语言处理领域的热门研究课题。由于异构属性,不同模态之间存在巨大的语义鸿沟。如何建立不同模态数据之间的相关性面临着巨大的挑战。在这项工作中,我们提出了一种名为双多角度自我注意(DMASA)的新型端到端框架,用于跨模态检索。应用多种自注意力机制从不同角度提取图像和文本的细粒度特征。然后我们将粗粒度和细粒度特征整合到一个多模态嵌入空间中,其中可以直接比较图像和文本之间的相似度。此外,我们提出了一种特殊的多阶段训练策略,其中前一阶段可以为后一阶段提供一个很好的初始值,使我们的框架更好地工作。在 Flickr8k、Flickr30k 和 MSCOCO 的三个基准数据集上,可以实现对最先进方法的非常有希望的实验结果。
更新日期:2020-07-16
down
wechat
bug