Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 5-10-2022 , DOI: 10.1109/tpami.2022.3173658
Mengmeng Wang ₁ , Jiazheng Xing ₂ , Jing Su ₃ , Jun Chen ₁ , Liu Yong ₁

Affiliation

Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Besides, we provide a distinctive illustration of the two modules from the frequency domain by interpreting them as advanced and learnable versions of frequency components. Second, we combine these two modules and an identity mapping path into one united block that can easily replace the original residual block in the ResNet architecture, forming a simple yet effective network dubbed STM network by introducing very limited extra computation cost and parameters. Third, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 & v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.

中文翻译：

在统一 2D 网络中学习时空和运动特征以进行动作识别

最近的动作识别方法总是应用 3D 卷积神经网络 (CNN) 来提取时空特征并引入光流来呈现运动特征。尽管实现了最先进的性能，但它们在时间和空间上都很昂贵。在本文中，我们建议在统一的 2D CNN 中表示两种特征，而不需要任何 3D 卷积或光流计算。特别是，我们首先设计了一个通道级时空模块来呈现时空特征，并设计了一个通道级运动模块来有效地编码特征级运动特征。此外，我们通过将这两个模块解释为频率分量的高级且可学习的版本，从频域对这两个模块进行了独特的说明。其次，我们将这两个模块和恒等映射路径组合成一个统一的块，可以轻松替换 ResNet 架构中的原始残差块，通过引入非常有限的额外计算成本和参数，形成一个简单而有效的网络，称为 STM 网络。第三，我们提出了一种用于动作识别的新颖的双胞胎训练框架，通过结合相关性损失来优化类间和类内相关性，并采用连体结构来充分扩展训练数据。我们在时间相关数据集（即 Something-Something v1 和 v2）和场景相关数据集（即 Kinetics-400、UCF-101 和 HMDB-51）上广泛验证了所提出的 STM。它在所有数据集中与最先进的方法相比取得了良好的结果。

更新日期：2024-08-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11