Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2021-02-09 , DOI: 10.1109/tip.2020.3048680
Mengshi Qi , Jie Qin , Yi Yang , Yunhong Wang , Jiebo Luo

With the current exponential growth of video-based social networks, video retrieval using natural language is receiving ever-increasing attention. Most existing approaches tackle this task by extracting individual frame-level spatial features to represent the whole video, while ignoring visual pattern consistencies and intrinsic temporal relationships across different frames. Furthermore, the semantic correspondence between natural language queries and person-centric actions in videos has not been fully explored. To address these problems, we propose a novel binary representation learning framework, named Semantics-aware Spatial-temporal Binaries (

$\text{S}^{2}$

Bin), which simultaneously considers spatial-temporal context and semantic relationships for cross-modal video retrieval. By exploiting the semantic relationships between two modalities,

$\text{S}^{2}$

Bin can efficiently and effectively generate binary codes for both videos and texts. In addition, we adopt an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training. We evaluate our model on three video datasets and the experimental results demonstrate that

$\text{S}^{2}$

Bin outperforms the state-of-the-art methods in terms of various cross-modal video retrieval tasks.

中文翻译：

跨模态视频检索的语义感知时空二进制

随着基于视频的社交网络的当前指数增长，使用自然语言的视频检索越来越受到关注。大多数现有方法通过提取单个帧级空间特征来表示整个视频来解决此任务，而忽略了视觉模式一致性和跨不同帧的固有时间关系。此外，自然语言查询和视频中以人为中心的动作之间的语义对应关系还没有得到充分研究。为了解决这些问题，我们提出了一种新颖的二进制表示学习框架，名为语义感知时空二进制（

$ \ text {S} ^ {2} $

Bin），同时考虑跨时态视频检索的时空上下文和语义关系。通过利用两种方式之间的语义关系，

$ \ text {S} ^ {2} $

Bin可以有效地为视频和文本生成二进制代码。此外，我们采用迭代优化方案，通过属性指导的随机训练来学习深度编码功能。我们在三个视频数据集上评估了我们的模型，实验结果表明

$ \ text {S} ^ {2} $

在各种跨模态视频检索任务方面，Bin优于最新技术。

更新日期：2021-02-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11