T-VLAD: Temporal vector of locally aggregated descriptor for multiview human action recognition

doi:10.1016/j.patrec.2021.04.023

Pattern Recognition Letters

Volume 148, August 2021, Pages 22-28

https://doi.org/10.1016/j.patrec.2021.04.023 Get rights and content

Highlights

•
We encode a long-term temporal structure of actions using single stream C3D features from short segments of video.
•
We include time-order information in encoding temporal sequence of complete action video, hence named temporal VLAD (T-VLAD).
•
T-VLAD timestamps action primitive motions in short segments of video, facilitating view invariant action recognition.
•
State-of-the-art results are shown on fixed setup multiview datasets, MuHAVi and IXMAS.
•
Proposed encoding scheme T-VLAD performs equally well on a dynamic background dataset, UCF-101.

Abstract

Robust view-invariant human action recognition (HAR) requires effective representation of its temporal structure in multi-view videos. This study explores a view-invariant action representation based on convolutional features. Action representation over long video segments is computationally expensive, whereas features in short video segments limit the temporal coverage locally. Previous methods are based on complex multi-stream deep convolutional feature maps extracted over short segments. To cope with this issue, a novel framework is proposed based on a temporal vector of locally aggregated descriptors (T-VLAD). T-VLAD encodes long term temporal structure of the video employing single stream convolutional features over short segments. A standard VLAD vector size is a multiple of its feature codebook size (256 is normally recommended). VLAD is modified to incorporate time-order information of segments, where the T-VLAD vector size is a multiple of its smaller time-order codebook size. Previous methods have not been extensively validated for view-variation. Results are validated in a challenging setup, where one view is used for testing and the remaining views are used for training. State-of-the-art results have been obtained on three multi-view datasets with fixed cameras, IXMAS, MuHAVi and MCAD. Also, the proposed encoding approach T-VLAD works equally well on a dynamic background dataset, UCF101.

Introduction

A robust view-invariant HAR framework requires effective multi-view video representation. Influential multi-view methods [22], [37] suggest transfer of view knowledge from one view to a high level virtual view. However, these methods require view-specific information. Recent CNN-based (Convolutional Neural Networks) HAR methods have shown significant progress in recognizing actions from single views. However, according to the review in [35], these methods are unable to overcome the challenge of view-point variation. So, there is a need to explore simpler CNN features methods that can handle view variation.

Different from hand-crafted methods for action representation [20], [30], CNN-based methods learn trainable action representations directly from video. CNNs were originally intended to extract 2D or spatial features and not the 3D spatio-temporal characteristics of video. Hence, a major interest is on how to incorporate temporal information in CNNs. An important and popular method that encodes temporal information, takes stacked optical flow along with RGB frames as input in a two stream model [23]. In [8], the two stream model was extended by the addition of a novel spatial and temporal fusion layer using 3D Conv and 3D pooling. Tran et al. [26] factorized 3D Conv into two separate 2D spatial Conv and 1D temporal Conv which are easier to optimize. In [15] temporal excitation and aggregation (TEA) block introduced multiple temporal excitation module (MTA) that replaced 1D temporal Conv with a group of sub-convolutions. Gammulle et al. [9] exploited the function of memory cells in LSTM to learn the long-range temporal structure of video. However, TS-LSTM [17] suggests the use of pre-segmented data to fully exploit temporal information.

In [13], [36], the authors investigated approaches to fuse and aggregate temporal frame-level information into video level information. Varol et al. [29] applies Long term Temporal Convolutions (LTC) on a large number of frames to capture the long range temporal structure of video, whereas [32] introduces a non-local neural network for this purpose. To capture visual content over the entire video, Temporal Relation Network (TRN) [38] exploits temporal relational reasoning and a Temporal Segment Network (TSN) [31] computes features from randomly selected snippets of video segments. ActionS-ST-VLAD [27] aggregates entire video spatio-temporal local CNN features, ignoring time order information. SeqVLAD [33], unlike NetVLAD [1] and ActionVLAD [10], encodes temporal relationships of the sequential frames, combining trainable VLAD and Recurrent Convolutional Network (RCN). [24] proposes a Temporal-Spatial Mapping (TSM) function that maps time order information and CNN features of densely sampled frames into a 2D feature map, referred as VideoMap. This dense sampling increases the computational cost of the CNN and fixed sampling rate limits temporal coverage locally.

Most of the approaches [27], [29], [31], [33] proposed to model action long-range temporal structure are limited to two stream CNN features. These two stream models require computationally expensive optical flow (or other motion vectors) as additional input, and training of two deep networks. A single stream 3D-Convolutional network (C3D) [25] replaces 2D Conv kernel with 3D Conv kernel representing actions at short-segment-level as generic, efficient and compact spatio-temporal features. These features are simply aggregated, thus ignoring temporal relationship of sequential short segments. Bidirectional encoder representations from transformers (BERT) [12] added a layer at the end of 3DCNN that incorporates temporal information with attention mechanism. Temporal pyramid network (TPN) [34] forms a feature hierarchy that capture visual characteristics at various temporal scale of an action. Temporal shift module (TSM) [16] inserted in 2DCNN shifts part of the channels along the temporal dimension to facilitate information exchange among neighbouring frames. X3D [7] extended tiny spatial network along temporal duration, frame rate, spatial resolution, width, and depth to achieve efficient video network with target complexity. Temporal 3D Convolutional network (T3D) [5] extends the 2D DenseNet architecture, adding a 3D temporal transition layer (TTL) which functions on different temporal depths, capturing short, mid, and long-range temporal information. Although the computation of 3D Conv burdens GPU memory, they do not involve the computation of the co-relation between appearance and motion as in two stream models.

This work proposes a novel multi-view framework to capture temporal sequences of complete videos, exploiting the time order of non-overlapping video segments. Video segments spatio-temporal features are pooled employing second order statistics of VLAD based on action class membership and segment time order membership. Formation of sequential VLAD from a codebook is a simpler approach, as compared to end-to-end trainable networks SeqVLAD [33] and T3D [5] that capture temporal information through additional top layers SGRU-RCN and TTL, respectively. These layers increase network training time due to backpropagation in RCN, and more parameters for convolution at multiple temporal depths in TTL.

Motivation: The proposed method is inspired by Spatio Temporal VLAD (ST-VLAD) [6] that incorporates spatial location of each frame local features, T-VLAD includes temporal location of each video segment global features, hence extending frame-level information into video-level information. Moreover, human actions are combination of several common primitive motions and a discriminating motion. VLAD recognizes action based on discriminating motion. Common primitive motion occurs in a specific time order and discriminating motion may become occluded due to view variations. T-VLAD timestamps these primitive motions facilitating view invariant action recognition. This is illustrated in Fig. 5 and explained in detail in Section 3.6 with an example.

It is worth highlighting the following contributions: (1) We propose a long-term temporal structure of actions using single stream C3D features from sparsely sampled short segments of video. The computation of features from short video segments is computationally inexpensive [17]. (2) We include time-order information for encoding temporal sequence of complete video in a single feature vector. (3) The proposed action representation from short segment features is robust to view changes and works well with variable background sequences.

Section snippets

Proposed T-VLAD for multiview action recognition

This section describes the details of the proposed T-VLAD encoding approach for view-invariant action representation, as illustrated in Fig. 1. C3D features are computed over fixed sampling intervals of video to form a feature descriptor $X_{f}$ . An additional time-order descriptor $X_{t}$ is formed to show the order of segments. $X_{f}$ and $X_{t}$ are then used to learn codebooks $C_{f}$ and $C_{t}$ , respectively. Finally, assignments from $C_{t}$ are used to pool residuals $R_{f}$ and the membership information $S_{f}$ is evaluated

Datasets

The proposed approach has been evaluated on three fixed camera setup multi-view human actions datasets: IXMAS, MuHAVi and MCAD. IXMAS includes 1,650 videos from 5 camera views of 10 actors performing 11 actions. MuHAVi contains 3,443 videos from 8 different views of 7 actors performing 17 actions. MCAD is more complex as it contains 14,298 videos of 20 actors performing 18 actions recorded with 3 static cameras and 2 PTZ cameras. In this study only static cameras videos are used for evaluation.

Conclusion

This study has proposed a novel modification of VLAD to capture temporal sequence of complete action. T-VLAD encodes video non-overlapping segment global features using their time order information. Experiments on the MuHAVi, IXMAS,MCAD and UCF101 demonstrate that T-VLAD is an efficient view-invariant representation of complete action long-term temporal sequence that also performs equally well on a dynamic background. For densely sampled video, high sampling rate increases the number of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

Muhammad Haroon Yousaf has received funding from Higher Education Commission, Pakistan, for Swarm Robotics Lab under National Centre for Robotics and Automation (NCRA).

References (38)

C.-Y. Ma et al.
TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition
Signal Process. Image Commun.
(2019)
X. Peng et al.
Bag of visual words and fusion methods for action recognition: comprehensive study and good practice
Comput. Vis. Image Underst.
(2016)
G. Yao et al.
A review of convolutional-neural-network-based action recognition
Pattern Recognit. Lett.
(2019)
R. Arandjelovic et al.
NetvLAD: CNN architecture for weakly supervised place recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
D.R. Beddiar et al.
Vision-based human activity recognition: a survey
Multimed. Tools Appl.
(2020)
V.A. Chenarlogh et al.
A multi-view human action recognition system in limited data case using multi-stream CNN
2019 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)
(2019)
K.-P. Chou et al.
Robust feature-based automated multi-view human action recognition system
IEEE Access
(2018)
A. Diba et al.
Temporal 3D convnets using temporal transition layer
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
(2018)
I.C. Duta et al.
Spatio-temporal VLAD encoding for human action recognition in videos
International Conference on Multimedia Modeling
(2017)
C. Feichtenhofer
X3d: expanding architectures for efficient video recognition
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(2020)

C. Feichtenhofer et al.

Convolutional two-stream network fusion for video action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

H. Gammulle et al.

Two stream LSTM: a deep fusion framework for human action recognition

2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

(2017)

R. Girdhar et al.

ActionVLAD: learning spatio-temporal aggregation for action classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

A. Iosifidis et al.

Minimum variance extreme learning machine for human action recognition

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

(2014)

M.E. Kalfaoglu et al.

Late temporal modeling in 3D CNN architectures with bert for action recognition

A. Karpathy et al.

Large-scale video classification with convolutional neural networks

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

(2014)

W. Li et al.

Multi-camera action dataset for cross-camera action recognition benchmarking

2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

(2017)

Y. Li et al.

TEA: temporal excitation and aggregation for action recognition

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

(2020)

J. Lin et al.

TSM: temporal shift module for efficient video understanding

Proceedings of the IEEE/CVF International Conference on Computer Vision

(2019)

Cited by (12)

Frame-part-activated deep reinforcement learning for Action Prediction
2024, Pattern Recognition Letters
In this paper, we propose a frame-part-activated deep reinforcement learning (FPA-DRL) for action prediction. Most existing methods for action prediction utilize the evolution of whole frames to model actions, which cannot avoid the noise of the current action, especially in the early prediction. Moreover, the loss of structural information of human body diminishes the capacity of features to describe actions. To address this, we design a FPA-DRL to exploit the structure of the human body by extracting skeleton proposals and reduce the redundancy of frames under a deep reinforcement learning framework. Specifically, we extract features from different parts of the human body individually, activate the action-related parts in features and the action-related frames in videos to enhance the representation. Our method not only exploits the structure information of the human body, but also considers the attention frame for expressing actions. We evaluate our method on three popular action prediction datasets: UT-Interaction, BIT-Interaction and UCF101. Our experimental results demonstrate that our method achieves the very competitive performance with state-of-the-arts.
Participants-based Synchronous Optimization Network for skeleton-based action recognition
2023, Pattern Recognition Letters
Nowadays, graph convolutional networks are widely used in skeleton-based action recognition. However, these methods ignore the difference between main participant and subordinate participant, as well as the consistency and causality reasoning in human–human interactive actions. In this paper, we construct a novel Participants-based Synchronous Optimization Network (PSONet). Firstly, we construct main participant branch, subordination participant branch and relative movements branch for the individual and interactive information of participants. Secondly, in the training process, Participants-based Synchronous Response (PSR) loss is constructed to optimize our network. Online mutual response mechanism in PSR regulates the consistency and captures the causality between the main participant action and the overall interactive action. Joint cross-entropy loss in PSR is used to constrain the action instances with individual and interactive action information. Finally, Representative Temporal Enhanced (RTE) block is proposed to complement representative temporal aggregation features and enhance the spatial modeling of representative temporal frames. Experiments have been conducted on the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset, which have verified that PSONet outperforms state-of-the-art methods.
Unsupervised video segmentation for multi-view daily action recognition
2023, Image and Vision Computing
Multi-layer representations have achieved outstanding performances in the complex daily action recognition. However, the fixed length of the sliding window (SW) leads to the segmented motion atoms incomplete or non-unique. To deal with this problem, we propose unsupervised video segmentation (UVS) for multi-view daily action recognition. Firstly, the average cosine similarity is designed to ensure the integrity of motion atoms. Secondly, we utilize the ordered combination of motion atoms to construct the table of multi-scale motion phrases in a top-down manner, instead of the fixed-scale traditional motion phrases. Finally, the experimental results based on the WVU dataset, the NTU RGB-D 120 dataset, and the N-UCLA dataset show that the proposed UVS method has state-of-the-art performance, compared with the classic methods such as IDT, MoFAP, JLMF, FGCN, and MVMLR.
Learning time-aware features for action quality assessment
2022, Pattern Recognition Letters
Citation Excerpt :
Action quality assessment (AQA) aims to evaluate people’s behaviors in videos, which has attracted a lot of attention in computer vision area. It is closely related to many existing methods such as gesture detection [1] and action recognition [2–6], sport event analysis [7–9], operation skill analysis [10,11], and medical surgery [12,13], etc. For instance, in some sport training process, we can make some improvements to the human action based on the feedback of the score given by the action assessment model.
Action quality assessment (AQA) is a task to assess the performance of a human action, which can be widely used in many real-world scenarios such as sport events. Current AQA methods generally extract features from the video and perform regression analysis to obtain the action quality score. In this process, aggregated video features may not reflect different stages of an action, which are important to evaluate an action is good or not. To address this issue, we propose to divide the video into different clips and learn the relationship between them, which may capture the action changes for accurate assessment. Time-aware (TA) attention mechanism is used to evaluate this relationship. In the experiment, our proposed method achieves promising results on the MTL-AQA dataset compared with existing AQA methods.
Advances in human action, activity and gesture recognition
2022, Pattern Recognition Letters
A set of advanced approaches and models on human action, activity, gesture, and behavior recognition related aspects along with associated challenges are summarized in this note. Notably, Video-based human activity recognition, sensor-based activity analysis, skeleton-based activity recognition, assisted daily living, anomaly detection, facial expression and emotion analysis, gesture and sign language are highlighted in the works. This special issue also introduced six new datasets, while exploring a total of 55 different datasets in its 21 research articles. Apart from the the areas covered in this issue, research on multi-modal analysis, action quality assessment, real-time applications, and activity and behavior computing on edge devices are some of the dominant future challenges to deal with. We firmly believe that the original works and ideas presented in this special issue will serve as helpful references for the relevant research communities in the journey towards a brighter future.
The Computer Vision Simulation of Athlete's Wrong Actions Recognition Model Based on Artificial Intelligence
2024, IEEE Access

View all citing articles on Scopus

View full text

T-VLAD: Temporal vector of locally aggregated descriptor for multiview human action recognition

Highlights

Abstract

Introduction

Section snippets

Proposed T-VLAD for multiview action recognition

Datasets

Conclusion

Declaration of Competing Interest

Acknowledgement

Signal Process. Image Commun.

Comput. Vis. Image Underst.

Pattern Recognit. Lett.

NetvLAD: CNN architecture for weakly supervised place recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Vision-based human activity recognition: a survey

Multimed. Tools Appl.

A multi-view human action recognition system in limited data case using multi-stream CNN

2019 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)

Robust feature-based automated multi-view human action recognition system

IEEE Access

Temporal 3D convnets using temporal transition layer

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

Spatio-temporal VLAD encoding for human action recognition in videos

International Conference on Multimedia Modeling

X3d: expanding architectures for efficient video recognition

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Convolutional two-stream network fusion for video action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Two stream LSTM: a deep fusion framework for human action recognition

2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

ActionVLAD: learning spatio-temporal aggregation for action classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Minimum variance extreme learning machine for human action recognition

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Late temporal modeling in 3D CNN architectures with bert for action recognition

Large-scale video classification with convolutional neural networks

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

Multi-camera action dataset for cross-camera action recognition benchmarking

2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

TEA: temporal excitation and aggregation for action recognition

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

TSM: temporal shift module for efficient video understanding

Proceedings of the IEEE/CVF International Conference on Computer Vision