T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
arXiv - CS - Multimedia Pub Date : 2021-04-20 , DOI: arxiv-2104.10054
Xiaohan Wang, Linchao Zhu, Yi Yang

Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions. The key to this problem is to measure text-video similarities in a joint embedding space. However, most existing methods only consider the global cross-modal similarity and overlook the local details. Some works incorporate the local comparisons through cross-modal local matching and reasoning. These complex operations introduce tremendous computation. In this paper, we design an efficient global-local alignment method. The multi-modal video sequences and text features are adaptively aggregated with a set of shared semantic centers. The local cross-modal similarities are computed between the video feature and text feature within the same center. This design enables the meticulous local comparison and reduces the computational cost of the interaction between each text-video pair. Moreover, a global alignment method is proposed to provide a global cross-modal measurement that is complementary to the local perspective. The global aggregated visual features also provide additional supervision, which is indispensable to the optimization of the learnable semantic centers. We achieve consistent improvements on three standard text-video retrieval benchmarks and outperform the state-of-the-art by a clear margin.

中文翻译：

T2VLAD：用于文本视频检索的全局局部序列比对

文本视频检索是一项具有挑战性的任务，旨在基于自然语言描述来搜索相关的视频内容。解决此问题的关键是在联合嵌入空间中测量文本视频的相似性。但是，大多数现有方法仅考虑全局交叉模式相似性，而忽略了局部细节。一些作品通过交叉模式的局部匹配和推理结合了局部比较。这些复杂的操作会带来巨大的计算量。在本文中，我们设计了一种有效的全局-局部对齐方法。多模式视频序列和文本特征与一组共享的语义中心自适应地聚合在一起。在同一中心内的视频特征和文本特征之间计算局部交叉模态相似性。这种设计能够进行细致的局部比较，并减少了每个文本-视频对之间交互的计算成本。此外，提出了一种全局对准方法来提供与局部视角互补的全局交叉模式测量。全局聚合的视觉特征还提供了额外的监控，这对于优化可学习的语义中心是必不可少的。我们在三个标准的文本视频检索基准上取得了一致的改进，并且明显领先于最新技术。这对于学习型语义中心的优化是必不可少的。我们在三个标准的文本视频检索基准上取得了一致的改进，并且明显领先于最新技术。这对于学习型语义中心的优化是必不可少的。我们在三个标准的文本视频检索基准上取得了一致的改进，并且明显领先于最新技术。

更新日期：2021-04-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文