Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-01-19 , DOI: arxiv-2001.06891
Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatio-temporal tube of the queried object. STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form sentences, including the declarative sentences with explicit objects and interrogative sentences with unknown objects. Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of object relationship modeling. Thus, we then propose a novel Spatio-Temporal Graph Reasoning Network (STGRN) for this task. First, we build a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames. We then incorporate textual clues into the graph and develop the multi-step cross-modal graph reasoning. Next, we introduce a spatio-temporal localizer with a dynamic selection method to directly retrieve the spatio-temporal tubes without tube pre-generation. Moreover, we contribute a large-scale video grounding dataset VidSTG based on video relation dataset VidOR. The extensive experiments demonstrate the effectiveness of our method.

中文翻译：

它在哪里存在：多形式句子的时空视频基础

在本文中，我们考虑了一项新任务，多形式句子的时空视频接地 (STVG)。给定未修剪的视频和描述对象的陈述/疑问句，STVG 旨在定位被查询对象的时空管。STVG 有两个具有挑战性的设置：（1）我们需要从未修剪的视频中定位时空对象管，其中对象可能只存在于视频的非常小的片段中；（2）我们处理多形式的句子，包括带有显性宾语的陈述句和带有未知宾语的疑问句。由于管预生成无效和缺乏对象关系建模，现有方法无法解决 STVG 任务。因此，我们为该任务提出了一种新颖的时空图推理网络（STGRN）。第一的，我们构建了一个时空区域图来捕捉与时间对象动态的区域关系，它涉及每个帧中的隐式和显式空间子图以及跨帧的时间动态子图。然后我们将文本线索合并到图中并开发多步跨模态图推理。接下来，我们引入了一种具有动态选择方法的时空定位器，以在没有管预生成的情况下直接检索时空管。此外，我们贡献了一个基于视频关系数据集 VidOR 的大规模视频接地数据集 VidSTG。大量的实验证明了我们方法的有效性。它涉及每个帧中的隐式和显式空间子图以及跨帧的时间动态子图。然后我们将文本线索合并到图中并开发多步跨模态图推理。接下来，我们引入了一种具有动态选择方法的时空定位器，以在没有管预生成的情况下直接检索时空管。此外，我们贡献了一个基于视频关系数据集 VidOR 的大规模视频接地数据集 VidSTG。大量的实验证明了我们方法的有效性。它涉及每个帧中的隐式和显式空间子图以及跨帧的时间动态子图。然后我们将文本线索合并到图中并开发多步跨模态图推理。接下来，我们引入了一种具有动态选择方法的时空定位器，以在没有管预生成的情况下直接检索时空管。此外，我们贡献了一个基于视频关系数据集 VidOR 的大规模视频接地数据集 VidSTG。大量的实验证明了我们方法的有效性。我们贡献了一个基于视频关系数据集 VidOR 的大规模视频接地数据集 VidSTG。大量的实验证明了我们方法的有效性。我们贡献了一个基于视频关系数据集 VidOR 的大规模视频接地数据集 VidSTG。大量的实验证明了我们方法的有效性。

更新日期：2020-03-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文