当前位置: X-MOL 学术Comput. Vis. Image Underst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Video action detection by learning graph-based spatio-temporal interactions
Computer Vision and Image Understanding ( IF 4.3 ) Pub Date : 2021-02-27 , DOI: 10.1016/j.cviu.2021.103187
Matteo Tomei , Lorenzo Baraldi , Simone Calderara , Simone Bronzin , Rita Cucchiara

Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modeling. Following this line, we propose a graph-based framework to learn high-level interactions between people and objects, in both space and time. In our formulation, spatio-temporal relationships are learned through self-attention on a multi-layer graph structure which can connect entities from consecutive clips, thus considering long-range spatial and temporal dependencies. The proposed module is backbone independent by design and does not require end-to-end training. Extensive experiments are conducted on the AVA dataset, where our model demonstrates state-of-the-art results and consistent improvements over baselines built with different backbones. Code is publicly available at https://github.com/aimagelab/STAGE_action_detection.



中文翻译:

通过学习基于图的时空交互来检测视频动作

动作检测是一项复杂的任务,旨在检测视频片段中的人类动作并对其进行分类。通常,已经通过处理从视频分类主干中提取的细粒度特征来解决该问题。近来,由于物体和人体检测器的鲁棒性,对关系建模的关注已加深。遵循这一思路,我们提出了一个基于图的框架,以学习人与物之间在空间和时间上的高层交互。在我们的表述中,时空关系是通过在多层图结构上进行自我关注而学习的,该图结构可以连接来自连续剪辑的实体,从而考虑了长期的空间和时间依赖性。拟议的模块在设计上是独立于骨干网的,不需要端到端培训。在AVA数据集上进行了广泛的实验,其中我们的模型展示了最新的结果以及相对于使用不同主干建立的基准的持续改进。代码可从https://github.com/aimagelab/STAGE_action_detection公开获得。

更新日期:2021-03-07
down
wechat
bug