Spatial–Temporal Relation Reasoning for Action Prediction in Videos,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Spatial–Temporal Relation Reasoning for Action Prediction in Videos
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2021-02-12 , DOI: 10.1007/s11263-020-01409-9
Xinxiao Wu , Ruiqi Wang , Jingyi Hou , Hanxi Lin , Jiebo Luo

Action prediction in videos refers to inferring the action category label by an early observation of a video. Existing studies mainly focus on exploiting multiple visual cues to enhance the discriminative power of feature representation while neglecting important structure information in videos including interactions and correlations between different object entities. In this paper, we focus on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories. With this in mind, we propose a novel spatial–temporal relation reasoning approach that extracts the spatial relations between persons and objects in still frames and explores how these spatial relations change over time. Specifically, for spatial relation reasoning, we propose an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. For temporal relation reasoning, we propose a long short-term graph network to model both the short-term and long-term varying dynamics of the spatial relations with multi-scale receptive fields. By this means, our approach can accurately recognize the video content in terms of fine-grained object relations in both spatial and temporal domains to make prediction decisions. Moreover, in order to learn the latent correlations between spatial–temporal object relations and action categories in videos, a visual semantic relation loss is proposed to model the triple constraints between objects in semantic domain via VTransE. Extensive experiments on five public video datasets (i.e., 20BN-something-something, CAD120, UCF101, BIT-Interaction and HMDB51) demonstrate the effectiveness of the proposed spatial–temporal relation reasoning on action prediction.

中文翻译：

视频中动作预测的时空关系推理

视频中的动作预测是指通过对视频的早期观察来推断动作类别标签。现有研究主要集中于利用多种视觉线索来增强特征表示的判别力，同时忽略视频中的重要结构信息，包括不同对象实体之间的相互作用和相关性。在本文中，我们专注于人与上下文对象之间的时空关系的推理，以解释观察到的视频部分以预测动作类别。考虑到这一点，我们提出了一种新颖的时空关系推理方法，该方法提取了静止帧中人与物体之间的空间关系，并探讨了这些空间关系如何随时间变化。具体来说，对于空间关系推理，我们提出了一种改进的门控图神经网络来执行视频帧中视觉对象之间的空间关系推理。对于时间关系推理，我们提出了一个长期的短期图网络，以对具有多尺度接受域的空间关系的短期和长期变化动力学进行建模。通过这种方式，我们的方法可以根据空间和时间域中的细粒度对象关系准确地识别视频内容，从而做出预测决策。此外，为了学习视频中时空对象关系和动作类别之间的潜在关联，提出了一种视觉语义关系损失，通过VTransE对语义域中对象之间的三重约束进行建模。对五个公共视频数据集（即20BN-something-something，CAD120，UCF101，

更新日期：2021-02-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11