Modeling Multi-Label Action Dependencies for Temporal Action Localization,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Modeling Multi-Label Action Dependencies for Temporal Action Localization
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-03-04 , DOI: arxiv-2103.03027
Praveen Tirupattur, Kevin Duarte, Yogesh Rawat, Mubarak Shah

Real-world videos contain many complex actions with inherent relationships between action classes. In this work, we propose an attention-based architecture that models these action relationships for the task of temporal action localization in untrimmed videos. As opposed to previous works that leverage video-level co-occurrence of actions, we distinguish the relationships between actions that occur at the same time-step and actions that occur at different time-steps (i.e. those which precede or follow each other). We define these distinct relationships as action dependencies. We propose to improve action localization performance by modeling these action dependencies in a novel attention-based Multi-Label Action Dependency (MLAD)layer. The MLAD layer consists of two branches: a Co-occurrence Dependency Branch and a Temporal Dependency Branch to model co-occurrence action dependencies and temporal action dependencies, respectively. We observe that existing metrics used for multi-label classification do not explicitly measure how well action dependencies are modeled, therefore, we propose novel metrics that consider both co-occurrence and temporal dependencies between action classes. Through empirical evaluation and extensive analysis, we show improved performance over state-of-the-art methods on multi-label action localization benchmarks(MultiTHUMOS and Charades) in terms of f-mAP and our proposed metric.

中文翻译：

为时间动作本地化建模多标签动作相关性

现实世界中的视频包含许多复杂的动作，动作类之间具有固有的关系。在这项工作中，我们提出了一种基于注意力的架构，该架构为未修剪视频中的时间动作本地化任务建模了这些动作关系。与利用动作的视频级同时发生的先前作品相反，我们区分了在相同时间步长发生的动作与在不同时间步长发生的动作（即彼此之前或之后的动作）之间的关系。我们将这些不同的关系定义为动作依赖项。我们建议通过在新颖的基于注意力的多标签动作相关性（MLAD）层中对这些动作相关性进行建模来提高动作本地化性能。MLAD层包含两个分支：同现依赖分支和时间依赖分支分别对同现动作依赖和时间动作依赖进行建模。我们观察到，用于多标签分类的现有指标并未明确衡量对动作依存关系建模的程度，因此，我们提出了一种新颖的指标，该指标考虑了动作类之间的共现和时间依存关系。通过经验评估和广泛分析，我们在f-mAP和我们提出的指标方面，显示了在多标签动作本地化基准（MultiTHUMOS和Charades）上，与最新方法相比，性能得到了改善。我们观察到，用于多标签分类的现有指标并未明确衡量对动作依存关系建模的程度，因此，我们提出了一种新颖的指标，该指标考虑了动作类之间的共现和时间依存关系。通过经验评估和广泛分析，我们在f-mAP和我们提出的指标方面，显示了在多标签动作本地化基准（MultiTHUMOS和Charades）上，与最新方法相比，性能得到了改善。我们观察到，用于多标签分类的现有指标并未明确衡量对动作依存关系建模的程度，因此，我们提出了一种新颖的指标，该指标考虑了动作类之间的共现和时间依存关系。通过经验评估和广泛分析，我们在f-mAP和我们提出的指标方面，显示了在多标签动作本地化基准（MultiTHUMOS和Charades）上，与最新方法相比，性能得到了改善。

更新日期：2021-03-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文