当前位置: X-MOL 学术IEEE Trans. Circ. Syst. Video Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Real-Time Action Representation With Temporal Encoding and Deep Compression
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.3 ) Pub Date : 2020-03-31 , DOI: 10.1109/tcsvt.2020.2984569
Kun Liu , Wu Liu , Huadong Ma , Mingkui Tan , Chuang Gan

Deep neural networks have achieved remarkable success for video-based action recognition. However, most of existing approaches cannot be deployed in practice due to the high computational cost. To address this challenge, we propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Specifically, we propose a residual 3D Convolutional Neural Network (CNN) to capture complementary information on the appearance of a single frame and the motion between consecutive frames. Based on this CNN, we develop a new temporal encoding method to explore the temporal dynamics of the whole video. Furthermore, we integrate deep compression techniques with T-C3D to further accelerate the deployment of models via reducing the size of the model. By these means, heavy calculations can be avoided when doing the inference, which enables the method to deal with videos beyond real-time speed while keeping promising performance. We validate our approach by studying its action representation performance on four benchmarks over three different tasks. Our method achieves clear improvements on UCF101 action recognition benchmark against the state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model. The source code and the pre-trained models are publicly available at https://github.com/tc3d.

中文翻译:


具有时间编码和深度压缩的实时动作表示



深度神经网络在基于视频的动作识别方面取得了显着的成功。然而,由于计算成本较高,大多数现有方法无法在实践中部署。为了应对这一挑战,我们提出了一种新的实时卷积架构,称为时域卷积 3D 网络 (T-C3D),用于动作表示。 T-C3D 以分层多粒度方式学习视频动作表示,同时获得高处理速度。具体来说,我们提出了一个残差 3D 卷积神经网络(CNN)来捕获有关单个帧的外观和连续帧之间的运动的补充信息。基于该 CNN,我们开发了一种新的时间编码方法来探索整个视频的时间动态。此外,我们将深度压缩技术与T-C3D集成,通过减小模型的大小来进一步加速模型的部署。通过这些方式,在进行推理时可以避免繁重的计算,这使得该方法能够以超实时速度处理视频,同时保持良好的性能。我们通过研究其在三个不同任务的四个基准上的动作表示性能来验证我们的方法。与最先进的实时方法相比,我们的方法在 UCF101 动作识别基准上实现了明显的改进,在精度方面提高了 5.4%,在存储模型小于 5MB 的情况下,推理速度提高了 2 倍。源代码和预训练模型可在 https://github.com/tc3d 上公开获取。
更新日期:2020-03-31
down
wechat
bug