Activity Graph Transformer for Temporal Action Localization,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Activity Graph Transformer for Temporal Action Localization
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-01-21 , DOI: arxiv-2101.08540
Megha Nawhal, Greg Mori

We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video. Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video. The dominant paradigms in the literature process videos temporally to either propose action regions or directly produce frame-level detections. However, sequential processing of videos is problematic when the action instances have non-sequential dependencies and/or non-linear temporal ordering, such as overlapping action instances or re-occurrence of action instances over the course of the video. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. We evaluate our model on challenging datasets: THUMOS14, Charades, and EPIC-Kitchens-100. Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.

中文翻译：

用于时间动作本地化的活动图变压器

我们介绍了Activity Graph Transformer，这是一种用于时间动作本地化的端到端可学习模型，它接收视频作为输入，并直接预测出现在视频中的一组动作实例。在未修剪的视频中检测和定位动作实例需要对一个视频中的多个动作实例进行推理。文献中的主要范例在时间上处理视频，以提出动作区域或直接产生帧级检测。但是，当动作实例具有非顺序依赖性和/或非线性时间顺序时，例如重叠的动作实例或动作实例在视频过程中再次出现，则视频的顺序处理是有问题的。在这项工作中我们通过将视频作为图形的非顺序实体进行推理来捕获这种非线性的时间结构。我们在具有挑战性的数据集上评估我们的模型：THUMOS14，Charades和EPIC-Kitchens-100。我们的结果表明，我们提出的模型在很大程度上领先于最新技术。

更新日期：2021-01-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>