当前位置: X-MOL 学术Multimedia Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improved SSD using deep multi-scale attention spatial–temporal features for action recognition
Multimedia Systems ( IF 3.9 ) Pub Date : 2021-07-14 , DOI: 10.1007/s00530-021-00831-4
Shuren Zhou 1 , Jia Qiu 1 , Arun Solanki 2

The biggest difference between video-based action recognition and image-based action recognition is that the former has an extra feature of time dimension. Most methods of action recognition based on deep learning adopt: (1) using 3D convolution to modeling the temporal features; (2) introducing an auxiliary temporal feature, such as optical flow. However, the 3D convolution network usually consumes huge computational resources. The extraction of optical flow requires an extra tedious process with an extra space for storage, and is usually modeled for short-range temporal features. To construct the temporal features better, in this paper we propose a multi-scale attention spatial–temporal features network based on SSD, by means of piecewise on long range of the whole video sequence to sparse sampling of video, using the self-attention mechanism to capture the relation between one frame and the sequence of frames sampled on the entire range of video, making the network notice the representative frames on the sequence. Moreover, the attention mechanism is used to assign different weights to the inter-frame relations representing different time scales, so as to reasoning the contextual relations of actions in the time dimension. Our proposed method achieves competitive performance on two commonly used datasets: UCF101 and HMDB51.


使用深度多尺度注意时空特征改进 SSD 进行动作识别

基于视频的动作识别和基于图像的动作识别最大的区别在于前者具有额外的时间维度特征。大多数基于深度学习的动作识别方法采用:(1)使用3D卷积对时间特征进行建模;(2) 引入辅助时间特征,例如光流。然而,3D 卷积网络通常会消耗大量的计算资源。光流的提取需要额外的繁琐过程和额外的存储空间,并且通常针对短程时间特征进行建模。为了更好地构建时间特征,本文提出了一种基于SSD的多尺度注意力时空特征网络,通过对整个视频序列的长距离分段对视频进行稀疏采样,使用自注意力机制捕获一帧与在整个视频范围内采样的帧序列之间的关系,使网络注意到序列上的代表帧。此外,注意力机制用于为代表不同时间尺度的帧间关系分配不同的权重,从而在时间维度上推理动作的上下文关系。我们提出的方法在两个常用数据集上取得了有竞争力的性能:UCF101 和 HMDB51。从而在时间维度上推理动作的上下文关系。我们提出的方法在两个常用数据集上取得了有竞争力的性能:UCF101 和 HMDB51。从而在时间维度上推理动作的上下文关系。我们提出的方法在两个常用数据集上取得了有竞争力的性能:UCF101 和 HMDB51。
