Elsevier

Pattern Recognition

Volume 104, August 2020, 107353
Pattern Recognition

Timed-image based deep learning for action recognition in video sequences

https://doi.org/10.1016/j.patcog.2020.107353Get rights and content

Highlights

  • Image data conditioning issue: the paper first highlights that referring 2D spatial convolution to its 1D Hilbert based instance is highly accurate for information compressibility upon image frames associated with a wide class of video files.

  • Video library conditioning issue: because of the above compressibility, the paper proposes converting 2D + X data volume into a single meta-image file format called timed-image, prior to machine learning frameworks. This conversion is such that any 2D frame of the 2D + X data is reshaped as a 1D array indexed by a Hilbert space-filling curve and the third variable X of the initial file format becomes the second variable in the meta-image format.

  • Sensitive action recognition benchmark: the paper provides two datasets having respectively 2 and 3 violence video categories. The datasets involve visual non-violent, moderate and extreme violence actions.

  • Sensitive action recognition issue: outstanding 2-level and 3-level violence classification results are obtained from a deep convolutional neural networks trained from scratch and operating on meta-image databases.

Abstract

The paper addresses two issues relative to machine learning on 2D + X data volumes, where 2D refers to image observation and X denotes a variable that can be associated with time, depth, wavelength, etc. The first issue addressed is conditioning these structured volumes for compatibility with respect to convolutional neural networks operating on 2D image file formats. The second issue is associated with sensitive action detection in the “2D + Time” case (video clips and image time series). For the data conditioning issue, the paper first highlights that referring 2D spatial convolution to its 1D Hilbert based instance is highly accurate for information compressibility upon tight frames of convolutional networks. As a consequence of this compressibility, the paper proposes converting the 2D + X data volume into a single meta-image file format, prior to machine learning frameworks. This conversion is such that any 2D frame of the 2D + X data is reshaped as a 1D array indexed by a Hilbert space-filling curve and the third variable X of the initial file format becomes the second variable in the meta-image format. For the sensitive action recognition issue, the paper provides: (i) a 3 category video database involving non-violent, moderate and extreme violence actions; (ii) the conversion of this database into a timed meta-image database from the 2D + Time to 2D conditioning stage described above and (iii) outstanding 2-level and 3-level violence classification results from deep convolutional neural networks operating on meta-image databases.

Introduction

Convolutional Neural Networks (CNN) have proven outstanding performance in recent image processing engines. Many frameworks, specifically designed and optimized for the image file format, see [1, MatConvNet] for instance, have led to filling the semantic gap between raw images and the high level objects that can be recognized from their contents.

When considering a video 2D + t or a stereoscopic 2D + d file format,1 CNN based feature extraction requires:

  • [Option-1] either adapting the network configurations according to dimension extents (by considering dimension extension for network parameters, by separating spatial feature analysis and temporal/depth information processing, etc.);

  • [Option-2] or relating the 2D + X data to a 2D meta-image format i order to use directly the above frameworks (already optimized for image analysis).

The literature has mainly addressed [Option-1] with a wide range of observable directions. The first direction involves the computational technicality standpoint relatively with 2D + X feature extraction architectures: for instance, CPU2 and GPU3 based file architectures have been proposed recently in Sun [2] for 3D convolutions and max-pooling operations that are consistent with MatConvNet. In addition, Lin et al. [3] has proposed some alternative shift and merge modules for spatio-temporal information aggregation, in order to reduce 3D convolution computational complexity, Yang et al. [4] has proposed asymmetric one-directional 3D convolutions whereas [5] has preferred deformable 3D convolutions.

The second direction concerns two-level/stream architectures operating respectively on spatial and temporal (optical flow) features for learning actions in 2D + t video datasets: for instance, Simonyan and Zisserman [6] has proposed two independent CNNs for learning both still frames and motion between frames, and [7] has considered a refinement of [6] in terms of spatial and temporal information integration at several levels of granularity. Another solution proposed in Carreira and Zisserman [8] is as well a two-stream 2D + t architecture where images and their optical flows are processed separately by using convenient convolution operators prior to late fusion. Still in the same direction, Zhang and Hu [9] has proposed an adaptation of the two-stream framework for long-range video representation by using multiple local features whereas [10] has proposed an extension of the two-stream framework with respect to a multiscale perspective.

Some alternative directions can be found in:

  • Chen et al. [11] in terms of learning a projection matrix associated with principal orientations or in Wang et al. [12] from a series of monodimensional temporal convolution operations;

  • Ullah et al. [13] which adopts a bidirectional Long Short-Term Memory (LSTM) framework for a recurrent feature description strategy, with a constraint being the selection of specific video frames since, otherwise, dimensionality leads to non-tractable algorithms on limited computational resources;

  • Fang et al. [14] where graph parsing neural network architectures are developed and [15] where ontology like grammars can be used to disambiguate certain specific situations.

In terms of action recognition benchmarks, most of the above references have highlighted the intricacy in identifying generalizable 2D + X architectures and the most relevant strategy among the directions given above remains an open issue at present time.

On the one hand, the limitation affecting the 2D + X frameworks on large data volumes is the intricacy of nD convolution kernel updating strategies with respect to the capture of tiny objects/events in huge data when n is large. For these huge data and due to the above computational limitation, robust network design is challenging and training is, at present time, subject to assistance: for instance, in Yoon et al. [16], only 2D spatial directional convolution operations are used for a first stage training and certain weights obtained are selected to guide the 3D MatConvNet on a patch-by-patch basis. The same holds true for the two-stream spatial and temporal strategy given in Carreira and Zisserman [8] and Feichtenhofer et al. [7]: the approaches are chosen separable (handcrafted extractions of spatial and optical flow features whereas a single ‘intelligent’ 3D network should have been able to perform this extraction if exploratory of the intrinsic 3D feature space had been straightforward). Another solution to limit computational complexity is the use of compressed domain video representations as in Chadha et al. [17]. However, the results obtained by using this approach are slightly less relevant than those obtained by the two-stream fusion stages used in Carreira and Zisserman [8] for recognition of homogeneous actions on the same databases. Thus, compression can limit performance depending on its rate.

On the other hand, the major hardware issue when handling huge 2D + X datasets is the limited random-access memory available on standard computer architectures. This leads to limited training capabilities at present time since convergence to a desirable solution cannot be guaranteed when using tiny loads in the optimization batches.

It is worth mentioning that [Option-2] can be achieved by compacting spatial dimensions in 1D format, thus converting a 2D + t video data to a 2D meta-image for instance. But not all 2D-to-1D transforms guaranty nice properties for capturing dependencies that are intrinsic to spatial image features. In order to perform [Option-2] while compacting at best image spatial dependencies, the paper proposes to consider the Hilbert space-filling 1D image description.

The first set of contributions concerns the analysis of Hilbert space-filling curves with respect to a concise and compressible spatial feature representation with respect to convolution operators. This analysis is performed in terms of: (a) maximal spatial shifts loaded in regard to the length of the convolution filter and (b) sparsity degrees of convolution operators when the convolution is performed in the Hilbert 1D domain, see Sections 2 and 3 respectively.

The second set of contributions, provided in Section 4 as a valuable application of the former contribution, concerns a solution to the challenging issue of heterogeneous action recognition in 2D + t data. In contrast with the homogeneous action recognition issue where any category is composed of approximately the same types of motion (for example ‘running’, ‘smiles’, etc. handled among others in Carreira and Zisserman [8] thanks to homogeneous motion databases), the heterogeneous case of violence interpretation (several types of actions having the same consequence: a violence feeling) is very intricate and somewhat subjective. We will present a state of the art on violence detection in Section 4 and address violence action recognition on the basis of: (i) violence data benchmarking and (ii) 2D CNN operating heterogeneous action learning from Hilbert based timed meta-image datasets.

To summarize, the major contributions provided through [Option-2] aims at:

  • exploiting image spatial compressibility in order to reduce memory load issues,

  • learning both spatial and temporal features jointly,

  • deriving a framework that makes the use of well-known 2D image based frameworks straightforward.

The main processing steps associated with the paper and developed below are described in the block diagram given by Fig. 1.

Section snippets

Hilbert space-filling curves: spatial data loads with respect to convolution size

Throughout, G(M) always refers to a square grid indexing 2M × 2M pixels and associated with a left-upper (0,0) corner. This indexing is for convenience: Hilbert space-filling curves can be computed on non-square grids. For the adaptation of Hilbert space-filling constructions to image domain, see [18] (resampling operators) and [19] (lossless compression), among other references.

Non-stationarity of Hilbert domain convolution outputs

In this section, we are interested on the statistical properties of the convolution output J* given by Eq. (3). One can first note that if input image I is with constant mean, then the same holds true for J* since for any ℓ,E[J*()]=kh*(k)E[I(U(k),V(k))]=E[I]×kh*(k)Thus, we can assume without loss of generality that I is with zero-mean in the following.

The autocorrelation function of J* isE[J*()J*(+τ)]=k,kh*(k)h*(k)E[I(U(k),V(k))×I(U(+τk),V(+τk))]If I is stationary, then

Homogeneous versus heterogeneous action recognition

In terms of action categorization, homogeneity means that the category elements share the same spatio-temporal flow property: for instance, the action consisting in applying eye makeup has coarsely the same movements in UCF-101: from the bottle to the eyelashes and eyebrows. There is no other way to put this maskara (at least in these examples) and this makes the action more predictable: eyelashes / eyebrows, a bottle of mascara and an arm movement are enough to determine the action with great

Conclusion and prospects

This work has addressed intrinsic 3D feature learning from Hilbert based meta-image description of 3D data. The 3D = 2D + X description has been designed on the basis of duality between spatial 2D observations and the additional dimension X that can relate time, wavelength or depth information. The aim of this description was obtaining a good balance between spatial information (compacted in one dimension) and the additional information provided by variations in X.

More specifically, we have

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The work was supported by grant SAINS of the Université Savoie Mont Blanc, France. Numerical simulations have been performed thanks to the facilities offered by MUST computing center of Université Savoie Mont Blanc and French National Agency of Research PHOENIX ANR-15-CE23-0012 workstations. The violence datasets post-processed and analyzed were supplied by MEDIAEVAL and TECHNICOLOR initiatives. The authors are grateful to the USMB/LISTIC members who have participated in video clip annotations.

Abdourrahmane Mahamane Atto received: the Ph.D. degree in mathematics and application, co-delivered by the University of Rennes I and TELECOM Bretagne, France (2008) and the Habilitation degree for research supervision from the University Grenoble Alpes, France (2015). Since September 2011, he has been an associate professor at the University Savoie Mont Blanc, Polytech Annecy-Chambéry, LISTIC, France. His research interests concern mathematical methods and models for image time series and

References (40)

  • J. Lin et al.

    TSM: temporal shift module for efficient video understanding

    The IEEE International Conference on Computer Vision (ICCV)

    (2019)
  • H. Yang et al.

    Asymmetric 3D convolutional neural networks for action recognition

    Pattern Recognit.

    (2019)
  • C. Feichtenhofer et al.

    Convolutional two-stream network fusion for video action recognition

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • J. Carreira et al.

    Quo vadis, action recognition? A new model and the kinetics dataset

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • J. Zhang et al.

    Domain learning joint with semantic adaptation for human action recognition

    Pattern Recognit.

    (2019)
  • H.-S. Fang et al.

    Learning pose grammar to encode human body configuration for 3D pose estimation

    AAAI Conference on Artificial Intelligence

    (2018)
  • S. Qi et al.

    Learning human-object interactions by graph parsing neural networks

    The European Conference on Computer Vision (ECCV)

    (2018)
  • Y.-G. Yoon et al.

    Feasibility of 3D reconstruction of neural morphology using expansion microscopy and barcode-guided agglomeration

    Front. Comput. Neurosci.

    (2017)
  • A. Chadha et al.

    Compressed-domain video classification with deep neural networks: ǣthere’s way too much information to decode the matrixǥ

    2017 IEEE International Conference on Image Processing (ICIP)

    (2017)
  • J. Valantinas

    On the use of space-filling curves in changing image dimensionality

    Inf. Technol. Control

    (2005)
  • Cited by (22)

    • Few-shot Website Fingerprinting attack with Meta-Bias Learning

      2022, Pattern Recognition
      Citation Excerpt :

      In general, this approach is not only constrained by the limitations of hand-crafted features but also lacks interaction between feature representation and classifier model, making the model performance inferior.Deep website fingerprinting attack methods. Motivated by the remarkable success of deep learning techniques in computer vision, natural language processing, and other fields [27–36], several deep learning WF attack methods have been introduced [14–16]. This paradigm can well solve the aforementioned weaknesses as discussed above.

    • Deep Convolutional Neural Network to improve the performances of screening process in LBVS

      2022, Expert Systems with Applications
      Citation Excerpt :

      In the recent years, Deep Learning (DL) is considered as a sub-domain of machine learning (ML) which has grown enormously. This is mainly due to the success gained in the several fields such as voice recognition, text mining (Su & Lu, 2017), object detection, image recognition (Atto et al., 2020; Wu et al., 2020), and many other domains such as genomics and drug discovery (Sun et al., 2017). The recent success of deep learning (DL) offers the opportunity to develop algorithms and tools to extract automatically new representations of specific structures.

    • TOP-ALCM: A novel video analysis method for violence detection in crowded scenes

      2022, Information Sciences
      Citation Excerpt :

      Serrano et al. [13] summarized the video content as a result of image encoding appearance and leveraging motion areas that are very representative of violent situations. Atto et al. [23] reshaped the 2D + X data of video sequence 1D array indexed by a Hilbert space-filling curve, and the third variable X of the initial file format becomes the second variable in the meta-image format. Zhou et al. [24] constructed a FightNet, which consists of a spatial network used to extract features from RGB images, optical flow, and acceleration images.

    • Few-shot website fingerprinting attack

      2021, Computer Networks
      Citation Excerpt :

      Automated feature extraction. Recently, Inspired by the fast advance of deep learning in various fields such as computer vision and natural language processing [28–33], automated feature extraction starts to show great potentials in WF attack [16–18,34]. Different network architectures have been designed and explored with better performance than traditional WF attacks.

    • Hierarchical evolutionary classification framework for human action recognition using sparse dictionary optimization

      2021, Swarm and Evolutionary Computation
      Citation Excerpt :

      Various schemes involved in HAR can be broadly categorized as, the vision and wearable sensor-based schemes. Vision-based schemes [8–11] refers to the systems that utilize data from cameras mounted at predefined locations. The reliability of this scheme is prone to be affected by environmental barriers like occlusion, clutter, variation in illumination, shadows, etc.

    View all citing articles on Scopus

    Abdourrahmane Mahamane Atto received: the Ph.D. degree in mathematics and application, co-delivered by the University of Rennes I and TELECOM Bretagne, France (2008) and the Habilitation degree for research supervision from the University Grenoble Alpes, France (2015). Since September 2011, he has been an associate professor at the University Savoie Mont Blanc, Polytech Annecy-Chambéry, LISTIC, France. His research interests concern mathematical methods and models for image time series and information processing.

    Alexandre Benoit received Ph.D. degree in electronics and computer science from the University of Grenoble, INP in 2007. Starting 2008, he is an associate professor at Université Savoie Mont Blanc at LISTIC lab. His main research interest is related to still images and image temporal sequences understanding. He develops deep learning models for remote sensing and multimedia applications such as image classification, pixel level semantic segmentation and regression problems for astrophysics. He develops specific approaches adapted to the sensor and data such as a retina model now distributed in the OpenCV library, and specific deep neural networks and operators for hyperspectral image analysis and astrophysics.

    Patrick Lambert received the Ph.D. degree in signal processing from the National Polytechnic Institute of Grenoble, Grenoble, France, in 1983. He is currently a Full Professor with the School of Engineering,Université SavoieMont Blanc, Annecy, France, and amember of the Informatics, Systems, Information and Knowledge Processing Laboratory, Annecy, France. His research interests include image and video analysis, and actually dedicated to non-linear color filtering and automatic image understanding.

    View full text