Hybrid Space Learning for Language-based Video Retrieval,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hybrid Space Learning for Language-based Video Retrieval
arXiv - CS - Multimedia Pub Date : 2020-09-10 , DOI: arxiv-2009.05381
Jianfeng Dong, Xirong Li, Chaoxi Xu, Gang Yang, Xun Wang

This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method.

中文翻译：

用于基于语言的视频检索的混合空间学习

本文解决了文本视频检索这一具有挑战性的问题。在这种检索范式中，最终用户通过专门以自然语言句子形式描述的即席查询来搜索未标记的视频，而没有提供视觉示例。将视频作为帧序列，将查询作为单词序列，有效的序列到序列跨模态匹配至关重要。为此，需要首先将这两种模态编码为实值向量，然后投影到公共空间中。在本文中，我们通过提出一个双深度编码网络来实现这一点，该网络将视频和查询编码为它们自己的强大的密集表示。我们的新颖性有两个方面。首先，与求助于特定单级编码器的现有技术不同，提议的网络执行多级编码，以粗到细的方式表示两种模式的丰富内容。其次，与传统的基于概念或基于潜在空间的公共空间学习算法不同，我们引入了混合空间学习，它结合了潜在空间的高性能和概念空间的良好可解释性。双编码在概念上很简单，实际上很有效，并且通过混合空间学习进行了端到端的训练。对四个具有挑战性的视频数据集的大量实验表明了新方法的可行性。我们引入了混合空间学习，它结合了潜在空间的高性能和概念空间的良好可解释性。双编码在概念上很简单，实际上很有效，并且通过混合空间学习进行了端到端的训练。对四个具有挑战性的视频数据集的大量实验表明了新方法的可行性。我们引入了混合空间学习，它结合了潜在空间的高性能和概念空间的良好可解释性。双编码在概念上很简单，实际上很有效，并且通过混合空间学习进行了端到端的训练。对四个具有挑战性的视频数据集的大量实验表明了新方法的可行性。

更新日期：2020-09-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文