Learning Motion Representation for Real-Time Spatio-Temporal Action Localization,Pattern Recognition

当前位置： X-MOL 学术 › Pattern Recogn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Motion Representation for Real-Time Spatio-Temporal Action Localization
Pattern Recognition ( IF 7.5 ) Pub Date : 2020-07-01 , DOI: 10.1016/j.patcog.2020.107312
Dejun Zhang , Linchao He , Zhigang Tu , Shifu Zhang , Fei Han , Boxiong Yang

Abstract The current deep learning based spatio-temporal action localization methods that using motion information (predominated is optical flow) obtain the state-of-the-art performance. However, since the optical flow is pre-computed, leading to these methods face two problems – the computational efficiency is low and the whole network is not end-to-end trainable. We propose a novel spatio-temporal action localization approach with an integrated optical flow sub-network to address these two issues. Specifically, our designed flow subnet can estimate optical flow efficiently and accurately by using multiple consecutive RGB frames rather than two adjacent frames in a deep network, simultaneously, action localization is implemented in the same network interactive with flow computation end-to-end. To faster the speed, we exploit a neural network based feature fusion method in a pyramid hierarchical manner. It fuses spatial and temporal features at different granularities via combination function (i.e. concatenation) and point-wise convolution to obtain multiscale spatio-temporal action features. Experimental results on three publicly available datasets, e.g. UCF101-24, JHMDB and AVA show that with both RGB appearance and optical flow cues, the proposed method gets the state-of-the-art performance in both efficiency and accuracy. Noticeably, it gets a significant improvement on efficiency. Compared to the currently most efficient method, it is 1.9 times faster in the running speed and 1.3% video-mAP more accurate on the UCF101-24. Our proposed method reaches real-time computation for the first time (up to 38 FPS).

中文翻译：

学习实时时空动作定位的运动表示

摘要当前基于深度学习的时空动作定位方法使用运动信息（主要是光流）获得了最先进的性能。然而，由于光流是预先计算的，导致这些方法面临两个问题——计算效率低，整个网络不可端到端训练。我们提出了一种具有集成光流子网络的新颖时空动作定位方法来解决这两个问题。具体来说，我们设计的流子网可以通过在深度网络中使用多个连续的 RGB 帧而不是两个相邻的帧来高效准确地估计光流，同时，在与流计算端到端交互的同一网络中实现动作定位。为了更快的速度，我们以金字塔分层方式利用基于神经网络的特征融合方法。它通过组合函数（即串联）和逐点卷积融合不同粒度的时空特征，以获得多尺度时空动作特征。在三个公开可用的数据集上的实验结果，例如 UCF101-24、JHMDB 和 AVA 表明，通过 RGB 外观和光流线索，所提出的方法在效率和准确性方面都获得了最先进的性能。值得注意的是，它在效率上得到了显着提高。与目前最高效的方法相比，它在 UCF101-24 上的运行速度提高了 1.9 倍，视频-mAP 的准确度提高了 1.3%。我们提出的方法首次达到实时计算（高达 38 FPS）。

更新日期：2020-07-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11