MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recogntion,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recogntion
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-01-19 , DOI: arxiv-2001.06769
Kaiyu Shan, Yongtao Wang, Zhuoying Wang, Tingting Liang, Zhi Tang, Ying Chen, and Yangyan Li

To efficiently extract spatiotemporal features of video for action recognition, most state-of-the-art methods integrate 1D temporal convolution into a conventional 2D CNN backbone. However, they all exploit 1D temporal convolution of fixed kernel size (i.e., 3) in the network building block, thus have suboptimal temporal modeling capability to handle both long-term and short-term actions. To address this problem, we first investigate the impacts of different kernel sizes for the 1D temporal convolutional filters. Then, we propose a simple yet efficient operation called Mixed Temporal Convolution (MixTConv), which consists of multiple depthwise 1D convolutional filters with different kernel sizes. By plugging MixTConv into the conventional 2D CNN backbone ResNet-50, we further propose an efficient and effective network architecture named MSTNet for action recognition, and achieve state-of-the-art results on multiple benchmarks.

中文翻译：

MixTConv：用于有效动作识别的混合时间卷积核

为了有效地提取视频的时空特征以进行动作识别，大多数最先进的方法将一维时间卷积集成到传统的二维 CNN 主干中。然而，它们都在网络构建块中利用固定内核大小（即 3）的一维时间卷积，因此具有次优的时间建模能力来处理长期和短期行为。为了解决这个问题，我们首先研究了不同内核大小对一维时间卷积滤波器的影响。然后，我们提出了一种称为混合时间卷积 (MixTConv) 的简单而有效的操作，它由多个具有不同内核大小的深度一维卷积滤波器组成。通过将 MixTConv 插入传统的 2D CNN 主干 ResNet-50，

更新日期：2020-01-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文