Video action detection by learning graph-based spatio-temporal interactions

https://doi.org/10.1016/j.cviu.2021.103187Get rights and content
Under a Creative Commons license
open access

Highlights

  • Video action detection is addressed using spatio-temporal graphs.

  • A single graph can handle spatial and temporal relationships.

  • Improvements over robust backbones and state-of-the-art results are presented.

  • Improvements are obtained without backbone finetuning, learning only interactions.

Abstract

Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modeling. Following this line, we propose a graph-based framework to learn high-level interactions between people and objects, in both space and time. In our formulation, spatio-temporal relationships are learned through self-attention on a multi-layer graph structure which can connect entities from consecutive clips, thus considering long-range spatial and temporal dependencies. The proposed module is backbone independent by design and does not require end-to-end training. Extensive experiments are conducted on the AVA dataset, where our model demonstrates state-of-the-art results and consistent improvements over baselines built with different backbones. Code is publicly available at https://github.com/aimagelab/STAGE_action_detection.

MSC

68T10
68T45

Keywords

Video understanding
Action detection
Graph learning

Cited by (0)