Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals,IEEE Transactions on Neural Networks and Learning Systems

当前位置： X-MOL 学术 › IEEE Trans. Neural Netw. Learn. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals
IEEE Transactions on Neural Networks and Learning Systems ( IF 10.4 ) Pub Date : 2020-06-05 , DOI: 10.1109/tnnls.2020.2997020
Lu Jin , Zechao Li , Jinhui Tang

Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable image-text and video-text retrieval. The proposed deep hashing framework leverages 2-D convolutional neural networks (CNN) as the backbone network to capture the spatial information for image-text retrieval, while the 3-D CNN as the backbone network to capture the spatial and temporal information for video-text retrieval. In the DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both intermodality similarities and intramodality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for the classification task, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, intermodality similarity-preserving learning, semantic label-preserving learning, and hash function learning with different types of loss functions simultaneously. The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions. We conduct extensive experiments for both single-modal- and cross-modal-retrieval tasks on four widely used multimodal-retrieval data sets. Experimental results on both image-text- and video-text-retrieval tasks demonstrate that the DSMHN significantly outperforms the state-of-the-art methods.

中文翻译：

用于可扩展图像文本和视频文本检索的深度语义多模式哈希网络

哈希算法由于其高效的计算和存储，已被广泛应用于大规模多媒体数据的多模态检索。在本文中，我们提出了一种新颖的深度语义多模式哈希网络 (DSMHN)，用于可扩展的图像文本和视频文本检索。所提出的深度哈希框架利用 2-D 卷积神经网络 (CNN) 作为骨干网络来捕获图像文本检索的空间信息，而 3-D CNN 作为骨干网络来捕获视频的空间和时间信息-文本检索。在 DSMHN 中，通过显式保留模态间相似性和模态内语义标签来共同学习两组特定于模态的哈希函数。具体来说，假设学习到的哈希码对于分类任务来说应该是最优的，两个流网络被联合训练，通过在生成的哈希码上嵌入语义标签来学习哈希函数。此外，提出了一个统一的深度多模态哈希框架，通过同时利用不同类型损失函数的特征表示学习、模态间相似性保持学习、语义标签保持学习和哈希函数学习来学习紧凑和高质量的哈希码。所提出的 DSMHN 方法是一种通用且可扩展的深度哈希框架，适用于图像文本和视频文本检索，可以灵活地与不同类型的损失函数集成。我们在四个广泛使用的多模态检索数据集上对单模态和跨模态检索任务进行了广泛的实验。

更新日期：2020-06-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>