Spatial–temporal pooling for action recognition in videos,Neurocomputing

当前位置： X-MOL 学术 › Neurocomputing › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Spatial–temporal pooling for action recognition in videos
Neurocomputing ( IF 5.5 ) Pub Date : 2021-04-24 , DOI: 10.1016/j.neucom.2021.04.071
Jiaming Wang , Zhenfeng Shao , Xiao Huang , Tao Lu , Ruiqian Zhang , Xianwei Lv

Recently, deep convolutional neural networks have demonstrated great effectiveness in action recognition with both RGB and optical flow in the past decade. However, existing studies generally treat all frames and pixels equally, potentially leading to poor robustness of models. In this paper, we propose a novel parameter-free spatial–temporal pooling block (referred to as STP) for action recognition in videos to address this challenge. STP is proposed to learn spatial and temporal weights, which are further used to guide information compression. Different from other temporal pooling layers, STP is more efficient as it discards the non-informative frames in a certain clip. In addition, STP applies a novel loss function that forces the model to learn information from sparse and discriminative frames. Moreover, we introduce a dataset for ferry action classification, named Ferryboat-4, which includes four categories: Inshore, Offshore, Traffic, and Negative. This designed dataset can be used for the identification of ferries with abnormal behaviors, providing the essential information to support the supervision, management, and monitoring of ships. All the videos are acquired via real-world cameras. We perform extensive experiments on publicly available datasets as well as Ferryboat-4 and find that the proposed method outperforms several state-of-the-art methods in action classification. Source code and datasets are available at https://github.com/jiaming-wang/STP.

中文翻译：

时空合并，用于视频中的动作识别

最近，在过去的十年中，深度卷积神经网络在RGB和光流方面都表现出巨大的动作识别效果。但是，现有研究通常将所有帧和像素均等对待，从而可能导致模型的鲁棒性较差。在本文中，我们提出了一种新颖的无参数时空池块（称为STP），用于视频中的动作识别，以解决这一挑战。提出STP是为了学习空间和时间权重，该权重进一步用于指导信息压缩。与其他时间池化层不同，STP效率更高，因为它会丢弃特定剪辑中的非信息帧。另外，STP应用了新颖的损失函数，该函数迫使模型从稀疏和有区别的框架中学习信息。此外，我们引入了一个用于渡轮行为分类的数据集，名为Ferryboat-4，其中包括四个类别：近岸，近海，交通和负面。此设计的数据集可用于识别具有异常行为的渡轮，从而提供必要的信息来支持船舶的监督，管理和监视。所有视频都是通过真实相机获取的。我们对公开可用的数据集以及Ferryboat-4进行了广泛的实验并发现所提出的方法在动作分类方面胜过几种最先进的方法。源代码和数据集可从https://github.com/jiaming-wang/STP获得。

更新日期：2021-05-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11