A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing,IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

当前位置： X-MOL 学术 › IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( IF 5.5 ) Pub Date : 2021-04-05 , DOI: 10.1109/jstars.2021.3070872
Qimin Cheng ₁ , Yuzhuo Zhou ₂ , Peng Fu ₃ , Yuan Xu ₄ , Liang Zhang ₅

Affiliation

Because of the rapid growth of multimodal data from the internet and social media, a cross-modal retrieval has become an important and valuable task in recent years.The purpose of the cross-modal retrieval is to obtain the result data in one modality (e.g., image), which is semantically similar to the query data in another modality (e.g., text).In the field of remote sensing, despite a great number of existing works on image retrieval, there has only been a small amount of research on the cross-modal image-text retrieval, due to the scarcity of datasets and the complicated characteristics of remote sensing image data. In this article, we introduce a novel cross-modal image-text retrieval network to establish the direct relationship between remote sensing images and their paired text data. Specifically, in our framework, we designed a semantic alignment module to fully explore the latent correspondence between images and text, in which we used the attention and gate mechanisms to filter and optimize data features so that more discriminative feature representations can be obtained. Experimental results on four benchmark remote sensing datasets, including UCMerced-LandUse-Captions, Sydney-Captions, RSICD, and NWPU-RESISC45-Captions, well showed that our proposed method outperformed other baselines and achieved the state-of-the-art performance in remote sensing image-text retrieval tasks.

中文翻译：

遥感中跨模态图像文本检索的深度语义对齐网络

由于来自互联网和社交媒体的多模式数据的快速增长，近年来，跨模式检索已成为一项重要而有价值的任务。跨模式检索的目的是以一种模式（例如：，图像），在语义上类似于另一种形式（例如文本）中的查询数据。在遥感领域，尽管在图像检索方面已有大量的现有工作，但对图像检索的研究很少。由于数据集的稀缺性和遥感图像数据的复杂性，跨模式的图像-文本检索。在本文中，我们介绍了一种新颖的跨模式图像-文本检索网络，以建立遥感图像与其配对文本数据之间的直接关系。具体来说，在我们的框架中，我们设计了一个语义对齐模块，以充分探究图像和文本之间的潜在对应关系，其中我们使用注意力和门机制来过滤和优化数据特征，以便获得更具区分性的特征表示。在四个基准遥感数据集（包括UCMerced-LandUse-字幕，Sydney-字幕，RSICD和NWPU-RESISC45-字幕）上的实验结果很好地表明，我们提出的方法优于其他基准，并在遥感图像文本检索任务。

更新日期：2021-05-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>