当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2021-04-29 , DOI: 10.1109/tpami.2021.3076522
Sudhakar Kumawat , Manisha Verma , Yuta Nakashima , Shanmuganathan Raman

Conventional 3D convolutional neural networks (CNNs) are computationally expensive, memory intensive, prone to overfitting, and most importantly, there is a need to improve their feature learning capabilities. To address these issues, we propose spatio-temporal short term Fourier transform (STFT) blocks, a new class of convolutional blocks that can serve as an alternative to the 3D convolutional layer and its variants in 3D CNNs. An STFT block consists of non-trainable convolution layers that capture spatially and/or temporally local Fourier information using a STFT kernel at multiple low frequency points, followed by a set of trainable linear weights for learning channel correlations. The STFT blocks significantly reduce the space-time complexity in 3D CNNs. In general, they use 3.5 to 4.5 times less parameters and 1.5 to 1.8 times less computational costs when compared to the state-of-the-art methods. Furthermore, their feature learning capabilities are significantly better than the conventional 3D convolutional layer and its variants. Our extensive evaluation on seven action recognition datasets, including Something-something v1 and v2, Jester, Diving-48, Kinetics-400, UCF 101, and HMDB 51, demonstrate that STFT blocks based 3D CNNs achieve on par or even better performance compared to the state-of-the-art methods.

中文翻译:

深度时空时空STFT卷积神经网络用于人类动作识别。

传统的3D卷积神经网络(CNN)计算量大,内存密集,易于过度拟合,最重要的是,需要提高其特征学习能力。为了解决这些问题,我们提出了时空短期傅立叶变换(STFT)块,这是一类新的卷积块,可以替代3D卷积层及其3D CNN中的变体。STFT块由不可训练的卷积层组成,这些不可卷曲的卷积层使用STFT内核在多个低频点捕获空间和/或时间局部傅立叶信息,然后是一组可训练的线性权重,用于学习信道相关性。STFT块显着降低了3D CNN中的时空复杂度。通常,它们使用的参数少3.5到4.5倍,而使用1.5到1。与最先进的方法相比,计算成本低8倍。此外,它们的特征学习能力明显优于传统的3D卷积层及其变体。我们对七个动作识别数据集进行了广泛的评估,其中包括Something-something v1和v2,Jester,Diving-48,Kinetics-400,UCF 101和HMDB 51,这些结果表明,基于STFT块的3D CNN的性能与同等甚至更好最先进的方法。
更新日期:2021-04-29
down
wechat
bug