Elsevier

Pattern Recognition Letters

Volume 148, August 2021, Pages 22-28
Pattern Recognition Letters

T-VLAD: Temporal vector of locally aggregated descriptor for multiview human action recognition

https://doi.org/10.1016/j.patrec.2021.04.023Get rights and content

Highlights

  • We encode a long-term temporal structure of actions using single stream C3D features from short segments of video.

  • We include time-order information in encoding temporal sequence of complete action video, hence named temporal VLAD (T-VLAD).

  • T-VLAD timestamps action primitive motions in short segments of video, facilitating view invariant action recognition.

  • State-of-the-art results are shown on fixed setup multiview datasets, MuHAVi and IXMAS.

  • Proposed encoding scheme T-VLAD performs equally well on a dynamic background dataset, UCF-101.

Abstract

Robust view-invariant human action recognition (HAR) requires effective representation of its temporal structure in multi-view videos. This study explores a view-invariant action representation based on convolutional features. Action representation over long video segments is computationally expensive, whereas features in short video segments limit the temporal coverage locally. Previous methods are based on complex multi-stream deep convolutional feature maps extracted over short segments. To cope with this issue, a novel framework is proposed based on a temporal vector of locally aggregated descriptors (T-VLAD). T-VLAD encodes long term temporal structure of the video employing single stream convolutional features over short segments. A standard VLAD vector size is a multiple of its feature codebook size (256 is normally recommended). VLAD is modified to incorporate time-order information of segments, where the T-VLAD vector size is a multiple of its smaller time-order codebook size. Previous methods have not been extensively validated for view-variation. Results are validated in a challenging setup, where one view is used for testing and the remaining views are used for training. State-of-the-art results have been obtained on three multi-view datasets with fixed cameras, IXMAS, MuHAVi and MCAD. Also, the proposed encoding approach T-VLAD works equally well on a dynamic background dataset, UCF101.

Introduction

A robust view-invariant HAR framework requires effective multi-view video representation. Influential multi-view methods [22], [37] suggest transfer of view knowledge from one view to a high level virtual view. However, these methods require view-specific information. Recent CNN-based (Convolutional Neural Networks) HAR methods have shown significant progress in recognizing actions from single views. However, according to the review in [35], these methods are unable to overcome the challenge of view-point variation. So, there is a need to explore simpler CNN features methods that can handle view variation.

Different from hand-crafted methods for action representation [20], [30], CNN-based methods learn trainable action representations directly from video. CNNs were originally intended to extract 2D or spatial features and not the 3D spatio-temporal characteristics of video. Hence, a major interest is on how to incorporate temporal information in CNNs. An important and popular method that encodes temporal information, takes stacked optical flow along with RGB frames as input in a two stream model [23]. In [8], the two stream model was extended by the addition of a novel spatial and temporal fusion layer using 3D Conv and 3D pooling. Tran et al. [26] factorized 3D Conv into two separate 2D spatial Conv and 1D temporal Conv which are easier to optimize. In [15] temporal excitation and aggregation (TEA) block introduced multiple temporal excitation module (MTA) that replaced 1D temporal Conv with a group of sub-convolutions. Gammulle et al. [9] exploited the function of memory cells in LSTM to learn the long-range temporal structure of video. However, TS-LSTM [17] suggests the use of pre-segmented data to fully exploit temporal information.

In [13], [36], the authors investigated approaches to fuse and aggregate temporal frame-level information into video level information. Varol et al. [29] applies Long term Temporal Convolutions (LTC) on a large number of frames to capture the long range temporal structure of video, whereas [32] introduces a non-local neural network for this purpose. To capture visual content over the entire video, Temporal Relation Network (TRN) [38] exploits temporal relational reasoning and a Temporal Segment Network (TSN) [31] computes features from randomly selected snippets of video segments. ActionS-ST-VLAD [27] aggregates entire video spatio-temporal local CNN features, ignoring time order information. SeqVLAD [33], unlike NetVLAD [1] and ActionVLAD [10], encodes temporal relationships of the sequential frames, combining trainable VLAD and Recurrent Convolutional Network (RCN). [24] proposes a Temporal-Spatial Mapping (TSM) function that maps time order information and CNN features of densely sampled frames into a 2D feature map, referred as VideoMap. This dense sampling increases the computational cost of the CNN and fixed sampling rate limits temporal coverage locally.

Most of the approaches [27], [29], [31], [33] proposed to model action long-range temporal structure are limited to two stream CNN features. These two stream models require computationally expensive optical flow (or other motion vectors) as additional input, and training of two deep networks. A single stream 3D-Convolutional network (C3D) [25] replaces 2D Conv kernel with 3D Conv kernel representing actions at short-segment-level as generic, efficient and compact spatio-temporal features. These features are simply aggregated, thus ignoring temporal relationship of sequential short segments. Bidirectional encoder representations from transformers (BERT) [12] added a layer at the end of 3DCNN that incorporates temporal information with attention mechanism. Temporal pyramid network (TPN) [34] forms a feature hierarchy that capture visual characteristics at various temporal scale of an action. Temporal shift module (TSM) [16] inserted in 2DCNN shifts part of the channels along the temporal dimension to facilitate information exchange among neighbouring frames. X3D [7] extended tiny spatial network along temporal duration, frame rate, spatial resolution, width, and depth to achieve efficient video network with target complexity. Temporal 3D Convolutional network (T3D) [5] extends the 2D DenseNet architecture, adding a 3D temporal transition layer (TTL) which functions on different temporal depths, capturing short, mid, and long-range temporal information. Although the computation of 3D Conv burdens GPU memory, they do not involve the computation of the co-relation between appearance and motion as in two stream models.

This work proposes a novel multi-view framework to capture temporal sequences of complete videos, exploiting the time order of non-overlapping video segments. Video segments spatio-temporal features are pooled employing second order statistics of VLAD based on action class membership and segment time order membership. Formation of sequential VLAD from a codebook is a simpler approach, as compared to end-to-end trainable networks SeqVLAD [33] and T3D [5] that capture temporal information through additional top layers SGRU-RCN and TTL, respectively. These layers increase network training time due to backpropagation in RCN, and more parameters for convolution at multiple temporal depths in TTL.

Motivation: The proposed method is inspired by Spatio Temporal VLAD (ST-VLAD) [6] that incorporates spatial location of each frame local features, T-VLAD includes temporal location of each video segment global features, hence extending frame-level information into video-level information. Moreover, human actions are combination of several common primitive motions and a discriminating motion. VLAD recognizes action based on discriminating motion. Common primitive motion occurs in a specific time order and discriminating motion may become occluded due to view variations. T-VLAD timestamps these primitive motions facilitating view invariant action recognition. This is illustrated in Fig. 5 and explained in detail in Section 3.6 with an example.

It is worth highlighting the following contributions: (1) We propose a long-term temporal structure of actions using single stream C3D features from sparsely sampled short segments of video. The computation of features from short video segments is computationally inexpensive [17]. (2) We include time-order information for encoding temporal sequence of complete video in a single feature vector. (3) The proposed action representation from short segment features is robust to view changes and works well with variable background sequences.

Section snippets

Proposed T-VLAD for multiview action recognition

This section describes the details of the proposed T-VLAD encoding approach for view-invariant action representation, as illustrated in Fig. 1. C3D features are computed over fixed sampling intervals of video to form a feature descriptor Xf. An additional time-order descriptor Xt is formed to show the order of segments. Xf and Xt are then used to learn codebooks Cf and Ct, respectively. Finally, assignments from Ct are used to pool residuals Rf and the membership information Sf is evaluated

Datasets

The proposed approach has been evaluated on three fixed camera setup multi-view human actions datasets: IXMAS, MuHAVi and MCAD. IXMAS includes 1,650 videos from 5 camera views of 10 actors performing 11 actions. MuHAVi contains 3,443 videos from 8 different views of 7 actors performing 17 actions. MCAD is more complex as it contains 14,298 videos of 20 actors performing 18 actions recorded with 3 static cameras and 2 PTZ cameras. In this study only static cameras videos are used for evaluation.

Conclusion

This study has proposed a novel modification of VLAD to capture temporal sequence of complete action. T-VLAD encodes video non-overlapping segment global features using their time order information. Experiments on the MuHAVi, IXMAS,MCAD and UCF101 demonstrate that T-VLAD is an efficient view-invariant representation of complete action long-term temporal sequence that also performs equally well on a dynamic background. For densely sampled video, high sampling rate increases the number of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

Muhammad Haroon Yousaf has received funding from Higher Education Commission, Pakistan, for Swarm Robotics Lab under National Centre for Robotics and Automation (NCRA).

References (38)

  • C.-Y. Ma et al.

    TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition

    Signal Process. Image Commun.

    (2019)
  • X. Peng et al.

    Bag of visual words and fusion methods for action recognition: comprehensive study and good practice

    Comput. Vis. Image Underst.

    (2016)
  • G. Yao et al.

    A review of convolutional-neural-network-based action recognition

    Pattern Recognit. Lett.

    (2019)
  • R. Arandjelovic et al.

    NetvLAD: CNN architecture for weakly supervised place recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • D.R. Beddiar et al.

    Vision-based human activity recognition: a survey

    Multimed. Tools Appl.

    (2020)
  • V.A. Chenarlogh et al.

    A multi-view human action recognition system in limited data case using multi-stream CNN

    2019 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)

    (2019)
  • K.-P. Chou et al.

    Robust feature-based automated multi-view human action recognition system

    IEEE Access

    (2018)
  • A. Diba et al.

    Temporal 3D convnets using temporal transition layer

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2018)
  • I.C. Duta et al.

    Spatio-temporal VLAD encoding for human action recognition in videos

    International Conference on Multimedia Modeling

    (2017)
  • C. Feichtenhofer

    X3d: expanding architectures for efficient video recognition

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2020)
  • C. Feichtenhofer et al.

    Convolutional two-stream network fusion for video action recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • H. Gammulle et al.

    Two stream LSTM: a deep fusion framework for human action recognition

    2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

    (2017)
  • R. Girdhar et al.

    ActionVLAD: learning spatio-temporal aggregation for action classification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • A. Iosifidis et al.

    Minimum variance extreme learning machine for human action recognition

    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2014)
  • M.E. Kalfaoglu et al.

    Late temporal modeling in 3D CNN architectures with bert for action recognition

  • A. Karpathy et al.

    Large-scale video classification with convolutional neural networks

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    (2014)
  • W. Li et al.

    Multi-camera action dataset for cross-camera action recognition benchmarking

    2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

    (2017)
  • Y. Li et al.

    TEA: temporal excitation and aggregation for action recognition

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2020)
  • J. Lin et al.

    TSM: temporal shift module for efficient video understanding

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    (2019)
  • Cited by (12)

    • Learning time-aware features for action quality assessment

      2022, Pattern Recognition Letters
      Citation Excerpt :

      Action quality assessment (AQA) aims to evaluate people’s behaviors in videos, which has attracted a lot of attention in computer vision area. It is closely related to many existing methods such as gesture detection [1] and action recognition [2–6], sport event analysis [7–9], operation skill analysis [10,11], and medical surgery [12,13], etc. For instance, in some sport training process, we can make some improvements to the human action based on the feedback of the score given by the action assessment model.

    View all citing articles on Scopus
    View full text