SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries,arXiv - CS - Artificial Intelligence

当前位置： X-MOL 学术 › arXiv.cs.AI › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries
arXiv - CS - Artificial Intelligence Pub Date : 2020-11-24 , DOI: arxiv-2011.12091
Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, Gang Yang

Retrieving unlabeled videos by textual queries, known as Ad-hoc Video Search (AVS), is a core theme in multimedia data management and retrieval. The success of AVS counts on cross-modal representation learning that encodes both query sentences and videos into common spaces for semantic similarity computation. Inspired by the initial success of previously few works in combining multiple sentence encoders, this paper takes a step forward by developing a new and general method for effectively exploiting diverse sentence encoders. The novelty of the proposed method, which we term Sentence Encoder Assembly (SEA), is two-fold. First, different from prior art that use only a single common space, SEA supports text-video matching in multiple encoder-specific common spaces. Such a property prevents the matching from being dominated by a specific encoder that produces an encoding vector much longer than other encoders. Second, in order to explore complementarities among the individual common spaces, we propose multi-space multi-loss learning. As extensive experiments on four benchmarks (MSR-VTT, TRECVID AVS 2016-2019, TGIF and MSVD) show, SEA surpasses the state-of-the-art. In addition, SEA is extremely ease to implement. All this makes SEA an appealing solution for AVS and promising for continuously advancing the task by harvesting new sentence encoders.

中文翻译：

SEA：用于通过文本查询检索视频的句子编码器组件

通过文本查询来检索未标记的视频（称为临时视频搜索（AVS））是多媒体数据管理和检索的核心主题。AVS的成功取决于跨模式表示学习，该学习将查询语句和视频都编码到公共空间中以进行语义相似度计算。受到先前很少有的将多个句子编码器组合在一起的工作取得初步成功的启发，本文通过开发一种有效利用各种句子编码器的新的通用方法，迈出了一步。我们称之为句子编码器组件（SEA）的提议方法的新颖性是双重的。首先，与仅使用单个公共空间的现有技术不同，SEA支持在多个编码器特定的公共空间中进行文本视频匹配。这样的属性可以防止匹配由特定的编码器控制，该编码器产生的编码向量比其他编码器长得多。其次，为了探索各个公共空间之间的互补性，我们提出了多空间多损失学习。正如在四个基准（MSR-VTT，TRECVID AVS 2016-2019，TGIF和MSVD）上进行的广泛实验所示，SEA超越了最新技术。此外，SEA非常易于实施。所有这些使SEA成为AVS的理想解决方案，并有望通过收获新的句子编码器来不断推进任务。TGIF和MSVD）显示，SEA超越了最新技术。此外，SEA非常易于实施。所有这些使SEA成为AVS的理想解决方案，并有望通过收获新的句子编码器来不断推进任务。TGIF和MSVD）显示，SEA超越了最新技术。此外，SEA非常易于实施。所有这些使SEA成为AVS的理想解决方案，并有望通过收获新的句子编码器来不断推进任务。

更新日期：2020-11-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文