Timed-image based deep learning for action recognition in video sequences
Introduction
Convolutional Neural Networks (CNN) have proven outstanding performance in recent image processing engines. Many frameworks, specifically designed and optimized for the image file format, see [1, MatConvNet] for instance, have led to filling the semantic gap between raw images and the high level objects that can be recognized from their contents.
When considering a video 2D + t or a stereoscopic 2D + d file format,1 CNN based feature extraction requires:
- •
[Option-1] either adapting the network configurations according to dimension extents (by considering dimension extension for network parameters, by separating spatial feature analysis and temporal/depth information processing, etc.);
- •
[Option-2] or relating the 2D + X data to a 2D meta-image format i order to use directly the above frameworks (already optimized for image analysis).
The literature has mainly addressed [Option-1] with a wide range of observable directions. The first direction involves the computational technicality standpoint relatively with 2D + X feature extraction architectures: for instance, CPU2 and GPU3 based file architectures have been proposed recently in Sun [2] for 3D convolutions and max-pooling operations that are consistent with MatConvNet. In addition, Lin et al. [3] has proposed some alternative shift and merge modules for spatio-temporal information aggregation, in order to reduce 3D convolution computational complexity, Yang et al. [4] has proposed asymmetric one-directional 3D convolutions whereas [5] has preferred deformable 3D convolutions.
The second direction concerns two-level/stream architectures operating respectively on spatial and temporal (optical flow) features for learning actions in 2D + t video datasets: for instance, Simonyan and Zisserman [6] has proposed two independent CNNs for learning both still frames and motion between frames, and [7] has considered a refinement of [6] in terms of spatial and temporal information integration at several levels of granularity. Another solution proposed in Carreira and Zisserman [8] is as well a two-stream 2D + t architecture where images and their optical flows are processed separately by using convenient convolution operators prior to late fusion. Still in the same direction, Zhang and Hu [9] has proposed an adaptation of the two-stream framework for long-range video representation by using multiple local features whereas [10] has proposed an extension of the two-stream framework with respect to a multiscale perspective.
Some alternative directions can be found in:
- •
Chen et al. [11] in terms of learning a projection matrix associated with principal orientations or in Wang et al. [12] from a series of monodimensional temporal convolution operations;
- •
Ullah et al. [13] which adopts a bidirectional Long Short-Term Memory (LSTM) framework for a recurrent feature description strategy, with a constraint being the selection of specific video frames since, otherwise, dimensionality leads to non-tractable algorithms on limited computational resources;
- •
Fang et al. [14] where graph parsing neural network architectures are developed and [15] where ontology like grammars can be used to disambiguate certain specific situations.
In terms of action recognition benchmarks, most of the above references have highlighted the intricacy in identifying generalizable 2D + X architectures and the most relevant strategy among the directions given above remains an open issue at present time.
On the one hand, the limitation affecting the 2D + X frameworks on large data volumes is the intricacy of nD convolution kernel updating strategies with respect to the capture of tiny objects/events in huge data when n is large. For these huge data and due to the above computational limitation, robust network design is challenging and training is, at present time, subject to assistance: for instance, in Yoon et al. [16], only 2D spatial directional convolution operations are used for a first stage training and certain weights obtained are selected to guide the 3D MatConvNet on a patch-by-patch basis. The same holds true for the two-stream spatial and temporal strategy given in Carreira and Zisserman [8] and Feichtenhofer et al. [7]: the approaches are chosen separable (handcrafted extractions of spatial and optical flow features whereas a single ‘intelligent’ 3D network should have been able to perform this extraction if exploratory of the intrinsic 3D feature space had been straightforward). Another solution to limit computational complexity is the use of compressed domain video representations as in Chadha et al. [17]. However, the results obtained by using this approach are slightly less relevant than those obtained by the two-stream fusion stages used in Carreira and Zisserman [8] for recognition of homogeneous actions on the same databases. Thus, compression can limit performance depending on its rate.
On the other hand, the major hardware issue when handling huge 2D + X datasets is the limited random-access memory available on standard computer architectures. This leads to limited training capabilities at present time since convergence to a desirable solution cannot be guaranteed when using tiny loads in the optimization batches.
It is worth mentioning that [Option-2] can be achieved by compacting spatial dimensions in 1D format, thus converting a 2D + t video data to a 2D meta-image for instance. But not all 2D-to-1D transforms guaranty nice properties for capturing dependencies that are intrinsic to spatial image features. In order to perform [Option-2] while compacting at best image spatial dependencies, the paper proposes to consider the Hilbert space-filling 1D image description.
The first set of contributions concerns the analysis of Hilbert space-filling curves with respect to a concise and compressible spatial feature representation with respect to convolution operators. This analysis is performed in terms of: (a) maximal spatial shifts loaded in regard to the length of the convolution filter and (b) sparsity degrees of convolution operators when the convolution is performed in the Hilbert 1D domain, see Sections 2 and 3 respectively.
The second set of contributions, provided in Section 4 as a valuable application of the former contribution, concerns a solution to the challenging issue of heterogeneous action recognition in 2D + t data. In contrast with the homogeneous action recognition issue where any category is composed of approximately the same types of motion (for example ‘running’, ‘smiles’, etc. handled among others in Carreira and Zisserman [8] thanks to homogeneous motion databases), the heterogeneous case of violence interpretation (several types of actions having the same consequence: a violence feeling) is very intricate and somewhat subjective. We will present a state of the art on violence detection in Section 4 and address violence action recognition on the basis of: (i) violence data benchmarking and (ii) 2D CNN operating heterogeneous action learning from Hilbert based timed meta-image datasets.
To summarize, the major contributions provided through [Option-2] aims at:
- •
exploiting image spatial compressibility in order to reduce memory load issues,
- •
learning both spatial and temporal features jointly,
- •
deriving a framework that makes the use of well-known 2D image based frameworks straightforward.
The main processing steps associated with the paper and developed below are described in the block diagram given by Fig. 1.
Section snippets
Hilbert space-filling curves: spatial data loads with respect to convolution size
Throughout, always refers to a square grid indexing 2M × 2M pixels and associated with a left-upper (0,0) corner. This indexing is for convenience: Hilbert space-filling curves can be computed on non-square grids. For the adaptation of Hilbert space-filling constructions to image domain, see [18] (resampling operators) and [19] (lossless compression), among other references.
Non-stationarity of Hilbert domain convolution outputs
In this section, we are interested on the statistical properties of the convolution output J* given by Eq. (3). One can first note that if input image I is with constant mean, then the same holds true for J* since for any ℓ,Thus, we can assume without loss of generality that I is with zero-mean in the following.
The autocorrelation function of J* isIf I is stationary, then
Homogeneous versus heterogeneous action recognition
In terms of action categorization, homogeneity means that the category elements share the same spatio-temporal flow property: for instance, the action consisting in applying eye makeup has coarsely the same movements in UCF-101: from the bottle to the eyelashes and eyebrows. There is no other way to put this maskara (at least in these examples) and this makes the action more predictable: eyelashes / eyebrows, a bottle of mascara and an arm movement are enough to determine the action with great
Conclusion and prospects
This work has addressed intrinsic 3D feature learning from Hilbert based meta-image description of 3D data. The 3D = 2D + X description has been designed on the basis of duality between spatial 2D observations and the additional dimension X that can relate time, wavelength or depth information. The aim of this description was obtaining a good balance between spatial information (compacted in one dimension) and the additional information provided by variations in X.
More specifically, we have
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The work was supported by grant SAINS of the Université Savoie Mont Blanc, France. Numerical simulations have been performed thanks to the facilities offered by MUST computing center of Université Savoie Mont Blanc and French National Agency of Research PHOENIX ANR-15-CE23-0012 workstations. The violence datasets post-processed and analyzed were supplied by MEDIAEVAL and TECHNICOLOR initiatives. The authors are grateful to the USMB/LISTIC members who have participated in video clip annotations.
Abdourrahmane Mahamane Atto received: the Ph.D. degree in mathematics and application, co-delivered by the University of Rennes I and TELECOM Bretagne, France (2008) and the Habilitation degree for research supervision from the University Grenoble Alpes, France (2015). Since September 2011, he has been an associate professor at the University Savoie Mont Blanc, Polytech Annecy-Chambéry, LISTIC, France. His research interests concern mathematical methods and models for image time series and
References (40)
- et al.
Spatio-temporal deformable 3D convnets with attention for action recognition
Pattern Recognit.
(2020) - et al.
Two-stream convolutional networks for action recognition in videos
Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1
(2014) - et al.
Learning motion representation for real-time spatio-temporal action localization
Pattern Recognit.
(2020) - et al.
Learning principal orientations and residual descriptor for action recognition
Pattern Recognit.
(2019) - et al.
Order-aware convolutional pooling for video based action recognition
Pattern Recognit.
(2019) - et al.
Action recognition in video sequences using deep bi-directional LSTM with CNN features
IEEE Access
(2018) - D. Hilbert, Über die Stetige Abbildung Einer Linie auf ein Flächenstück, Springer Berlin Heidelberg, Berlin,...
- et al.
Crowd violence detection using global motion-compensated lagrangian features and scale-sensitive video-level representation
IEEE Trans. Inf. Forensics Secur.
(2017) - et al.
Matconvnet: convolutional neural networks for matlab
Proceedings of the 23rd ACM International Conference on Multimedia
(2015) Mexconv3d: Matlab mex implementation of the basic operations for 3d (volume) convolutional neural network
Framew. Avaliable Online
(2016)
TSM: temporal shift module for efficient video understanding
The IEEE International Conference on Computer Vision (ICCV)
Asymmetric 3D convolutional neural networks for action recognition
Pattern Recognit.
Convolutional two-stream network fusion for video action recognition
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Quo vadis, action recognition? A new model and the kinetics dataset
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Domain learning joint with semantic adaptation for human action recognition
Pattern Recognit.
Learning pose grammar to encode human body configuration for 3D pose estimation
AAAI Conference on Artificial Intelligence
Learning human-object interactions by graph parsing neural networks
The European Conference on Computer Vision (ECCV)
Feasibility of 3D reconstruction of neural morphology using expansion microscopy and barcode-guided agglomeration
Front. Comput. Neurosci.
Compressed-domain video classification with deep neural networks: ǣthere’s way too much information to decode the matrixǥ
2017 IEEE International Conference on Image Processing (ICIP)
On the use of space-filling curves in changing image dimensionality
Inf. Technol. Control
Cited by (22)
Few-shot Website Fingerprinting attack with Meta-Bias Learning
2022, Pattern RecognitionCitation Excerpt :In general, this approach is not only constrained by the limitations of hand-crafted features but also lacks interaction between feature representation and classifier model, making the model performance inferior.Deep website fingerprinting attack methods. Motivated by the remarkable success of deep learning techniques in computer vision, natural language processing, and other fields [27–36], several deep learning WF attack methods have been introduced [14–16]. This paradigm can well solve the aforementioned weaknesses as discussed above.
Deep Convolutional Neural Network to improve the performances of screening process in LBVS
2022, Expert Systems with ApplicationsCitation Excerpt :In the recent years, Deep Learning (DL) is considered as a sub-domain of machine learning (ML) which has grown enormously. This is mainly due to the success gained in the several fields such as voice recognition, text mining (Su & Lu, 2017), object detection, image recognition (Atto et al., 2020; Wu et al., 2020), and many other domains such as genomics and drug discovery (Sun et al., 2017). The recent success of deep learning (DL) offers the opportunity to develop algorithms and tools to extract automatically new representations of specific structures.
TOP-ALCM: A novel video analysis method for violence detection in crowded scenes
2022, Information SciencesCitation Excerpt :Serrano et al. [13] summarized the video content as a result of image encoding appearance and leveraging motion areas that are very representative of violent situations. Atto et al. [23] reshaped the 2D + X data of video sequence 1D array indexed by a Hilbert space-filling curve, and the third variable X of the initial file format becomes the second variable in the meta-image format. Zhou et al. [24] constructed a FightNet, which consists of a spatial network used to extract features from RGB images, optical flow, and acceleration images.
Few-shot website fingerprinting attack
2021, Computer NetworksCitation Excerpt :Automated feature extraction. Recently, Inspired by the fast advance of deep learning in various fields such as computer vision and natural language processing [28–33], automated feature extraction starts to show great potentials in WF attack [16–18,34]. Different network architectures have been designed and explored with better performance than traditional WF attacks.
Hierarchical evolutionary classification framework for human action recognition using sparse dictionary optimization
2021, Swarm and Evolutionary ComputationCitation Excerpt :Various schemes involved in HAR can be broadly categorized as, the vision and wearable sensor-based schemes. Vision-based schemes [8–11] refers to the systems that utilize data from cameras mounted at predefined locations. The reliability of this scheme is prone to be affected by environmental barriers like occlusion, clutter, variation in illumination, shadows, etc.
A deep learning model based on sequential object feature accumulation for sport activity recognition
2023, Multimedia Tools and Applications
Abdourrahmane Mahamane Atto received: the Ph.D. degree in mathematics and application, co-delivered by the University of Rennes I and TELECOM Bretagne, France (2008) and the Habilitation degree for research supervision from the University Grenoble Alpes, France (2015). Since September 2011, he has been an associate professor at the University Savoie Mont Blanc, Polytech Annecy-Chambéry, LISTIC, France. His research interests concern mathematical methods and models for image time series and information processing.
Alexandre Benoit received Ph.D. degree in electronics and computer science from the University of Grenoble, INP in 2007. Starting 2008, he is an associate professor at Université Savoie Mont Blanc at LISTIC lab. His main research interest is related to still images and image temporal sequences understanding. He develops deep learning models for remote sensing and multimedia applications such as image classification, pixel level semantic segmentation and regression problems for astrophysics. He develops specific approaches adapted to the sensor and data such as a retina model now distributed in the OpenCV library, and specific deep neural networks and operators for hyperspectral image analysis and astrophysics.
Patrick Lambert received the Ph.D. degree in signal processing from the National Polytechnic Institute of Grenoble, Grenoble, France, in 1983. He is currently a Full Professor with the School of Engineering,Université SavoieMont Blanc, Annecy, France, and amember of the Informatics, Systems, Information and Knowledge Processing Laboratory, Annecy, France. His research interests include image and video analysis, and actually dedicated to non-linear color filtering and automatic image understanding.