Timed-image based deep learning for action recognition in video sequences

doi:10.1016/j.patcog.2020.107353

Pattern Recognition

Volume 104, August 2020, 107353

https://doi.org/10.1016/j.patcog.2020.107353 Get rights and content

Highlights

•
Image data conditioning issue: the paper first highlights that referring 2D spatial convolution to its 1D Hilbert based instance is highly accurate for information compressibility upon image frames associated with a wide class of video files.
•
Video library conditioning issue: because of the above compressibility, the paper proposes converting 2D + X data volume into a single meta-image file format called timed-image, prior to machine learning frameworks. This conversion is such that any 2D frame of the 2D + X data is reshaped as a 1D array indexed by a Hilbert space-filling curve and the third variable X of the initial file format becomes the second variable in the meta-image format.
•
Sensitive action recognition benchmark: the paper provides two datasets having respectively 2 and 3 violence video categories. The datasets involve visual non-violent, moderate and extreme violence actions.
•
Sensitive action recognition issue: outstanding 2-level and 3-level violence classification results are obtained from a deep convolutional neural networks trained from scratch and operating on meta-image databases.

Abstract

The paper addresses two issues relative to machine learning on 2D + X data volumes, where 2D refers to image observation and X denotes a variable that can be associated with time, depth, wavelength, etc. The first issue addressed is conditioning these structured volumes for compatibility with respect to convolutional neural networks operating on 2D image file formats. The second issue is associated with sensitive action detection in the “2D + Time” case (video clips and image time series). For the data conditioning issue, the paper first highlights that referring 2D spatial convolution to its 1D Hilbert based instance is highly accurate for information compressibility upon tight frames of convolutional networks. As a consequence of this compressibility, the paper proposes converting the 2D + X data volume into a single meta-image file format, prior to machine learning frameworks. This conversion is such that any 2D frame of the 2D + X data is reshaped as a 1D array indexed by a Hilbert space-filling curve and the third variable X of the initial file format becomes the second variable in the meta-image format. For the sensitive action recognition issue, the paper provides: (i) a 3 category video database involving non-violent, moderate and extreme violence actions; (ii) the conversion of this database into a timed meta-image database from the 2D + Time to 2D conditioning stage described above and (iii) outstanding 2-level and 3-level violence classification results from deep convolutional neural networks operating on meta-image databases.

Introduction

Convolutional Neural Networks (CNN) have proven outstanding performance in recent image processing engines. Many frameworks, specifically designed and optimized for the image file format, see [1, MatConvNet] for instance, have led to filling the semantic gap between raw images and the high level objects that can be recognized from their contents.

When considering a video 2D + t or a stereoscopic 2D + d file format,¹ CNN based feature extraction requires:

•
[Option-1] either adapting the network configurations according to dimension extents (by considering dimension extension for network parameters, by separating spatial feature analysis and temporal/depth information processing, etc.);
•
[Option-2] or relating the 2D + X data to a 2D meta-image format i order to use directly the above frameworks (already optimized for image analysis).

The literature has mainly addressed [Option-1] with a wide range of observable directions. The first direction involves the computational technicality standpoint relatively with 2D + X feature extraction architectures: for instance, CPU² and GPU³ based file architectures have been proposed recently in Sun [2] for 3D convolutions and max-pooling operations that are consistent with MatConvNet. In addition, Lin et al. [3] has proposed some alternative shift and merge modules for spatio-temporal information aggregation, in order to reduce 3D convolution computational complexity, Yang et al. [4] has proposed asymmetric one-directional 3D convolutions whereas [5] has preferred deformable 3D convolutions.

The second direction concerns two-level/stream architectures operating respectively on spatial and temporal (optical flow) features for learning actions in 2D + t video datasets: for instance, Simonyan and Zisserman [6] has proposed two independent CNNs for learning both still frames and motion between frames, and [7] has considered a refinement of [6] in terms of spatial and temporal information integration at several levels of granularity. Another solution proposed in Carreira and Zisserman [8] is as well a two-stream 2D + t architecture where images and their optical flows are processed separately by using convenient convolution operators prior to late fusion. Still in the same direction, Zhang and Hu [9] has proposed an adaptation of the two-stream framework for long-range video representation by using multiple local features whereas [10] has proposed an extension of the two-stream framework with respect to a multiscale perspective.

Some alternative directions can be found in:

•
Chen et al. [11] in terms of learning a projection matrix associated with principal orientations or in Wang et al. [12] from a series of monodimensional temporal convolution operations;
•
Ullah et al. [13] which adopts a bidirectional Long Short-Term Memory (LSTM) framework for a recurrent feature description strategy, with a constraint being the selection of specific video frames since, otherwise, dimensionality leads to non-tractable algorithms on limited computational resources;
•
Fang et al. [14] where graph parsing neural network architectures are developed and [15] where ontology like grammars can be used to disambiguate certain specific situations.

In terms of action recognition benchmarks, most of the above references have highlighted the intricacy in identifying generalizable 2D + X architectures and the most relevant strategy among the directions given above remains an open issue at present time.

On the one hand, the limitation affecting the 2D + X frameworks on large data volumes is the intricacy of nD convolution kernel updating strategies with respect to the capture of tiny objects/events in huge data when n is large. For these huge data and due to the above computational limitation, robust network design is challenging and training is, at present time, subject to assistance: for instance, in Yoon et al. [16], only 2D spatial directional convolution operations are used for a first stage training and certain weights obtained are selected to guide the 3D MatConvNet on a patch-by-patch basis. The same holds true for the two-stream spatial and temporal strategy given in Carreira and Zisserman [8] and Feichtenhofer et al. [7]: the approaches are chosen separable (handcrafted extractions of spatial and optical flow features whereas a single ‘intelligent’ 3D network should have been able to perform this extraction if exploratory of the intrinsic 3D feature space had been straightforward). Another solution to limit computational complexity is the use of compressed domain video representations as in Chadha et al. [17]. However, the results obtained by using this approach are slightly less relevant than those obtained by the two-stream fusion stages used in Carreira and Zisserman [8] for recognition of homogeneous actions on the same databases. Thus, compression can limit performance depending on its rate.

On the other hand, the major hardware issue when handling huge 2D + X datasets is the limited random-access memory available on standard computer architectures. This leads to limited training capabilities at present time since convergence to a desirable solution cannot be guaranteed when using tiny loads in the optimization batches.

It is worth mentioning that [Option-2] can be achieved by compacting spatial dimensions in 1D format, thus converting a 2D + t video data to a 2D meta-image for instance. But not all 2D-to-1D transforms guaranty nice properties for capturing dependencies that are intrinsic to spatial image features. In order to perform [Option-2] while compacting at best image spatial dependencies, the paper proposes to consider the Hilbert space-filling 1D image description.

The first set of contributions concerns the analysis of Hilbert space-filling curves with respect to a concise and compressible spatial feature representation with respect to convolution operators. This analysis is performed in terms of: (a) maximal spatial shifts loaded in regard to the length of the convolution filter and (b) sparsity degrees of convolution operators when the convolution is performed in the Hilbert 1D domain, see Sections 2 and 3 respectively.

The second set of contributions, provided in Section 4 as a valuable application of the former contribution, concerns a solution to the challenging issue of heterogeneous action recognition in 2D + t data. In contrast with the homogeneous action recognition issue where any category is composed of approximately the same types of motion (for example ‘running’, ‘smiles’, etc. handled among others in Carreira and Zisserman [8] thanks to homogeneous motion databases), the heterogeneous case of violence interpretation (several types of actions having the same consequence: a violence feeling) is very intricate and somewhat subjective. We will present a state of the art on violence detection in Section 4 and address violence action recognition on the basis of: (i) violence data benchmarking and (ii) 2D CNN operating heterogeneous action learning from Hilbert based timed meta-image datasets.

To summarize, the major contributions provided through [Option-2] aims at:

•
exploiting image spatial compressibility in order to reduce memory load issues,
•
learning both spatial and temporal features jointly,
•
deriving a framework that makes the use of well-known 2D image based frameworks straightforward.

The main processing steps associated with the paper and developed below are described in the block diagram given by Fig. 1.

Section snippets

Hilbert space-filling curves: spatial data loads with respect to convolution size

Throughout, $G (M)$ always refers to a square grid indexing 2^M × 2^M pixels and associated with a left-upper (0,0) corner. This indexing is for convenience: Hilbert space-filling curves can be computed on non-square grids. For the adaptation of Hilbert space-filling constructions to image domain, see [18] (resampling operators) and [19] (lossless compression), among other references.

Non-stationarity of Hilbert domain convolution outputs

In this section, we are interested on the statistical properties of the convolution output J* given by Eq. (3). One can first note that if input image I is with constant mean, then the same holds true for J* since for any ℓ, $E [J^{*} (ℓ)] = \sum_{k} h^{*} (k) E [I (U (ℓ - k), V (ℓ - k))] = E [I] \times \sum_{k} h^{*} (k)$ Thus, we can assume without loss of generality that I is with zero-mean in the following.

The autocorrelation function of J* is $\begin{matrix} E [J^{*} (ℓ) J^{*} (ℓ + τ)] = \sum_{k, k^{'}} h^{*} (k) h^{*} (k^{'}) E \\ [I (U (ℓ - k), V (ℓ - k)) \times I (U (ℓ + τ - k^{'}), V (ℓ + τ - k^{'}))] \end{matrix}$ If I is stationary, then

Homogeneous versus heterogeneous action recognition

In terms of action categorization, homogeneity means that the category elements share the same spatio-temporal flow property: for instance, the action consisting in applying eye makeup has coarsely the same movements in UCF-101: from the bottle to the eyelashes and eyebrows. There is no other way to put this maskara (at least in these examples) and this makes the action more predictable: eyelashes / eyebrows, a bottle of mascara and an arm movement are enough to determine the action with great

Conclusion and prospects

This work has addressed intrinsic 3D feature learning from Hilbert based meta-image description of 3D data. The 3D = 2D + X description has been designed on the basis of duality between spatial 2D observations and the additional dimension X that can relate time, wavelength or depth information. The aim of this description was obtaining a good balance between spatial information (compacted in one dimension) and the additional information provided by variations in X.

More specifically, we have

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The work was supported by grant SAINS of the Université Savoie Mont Blanc, France. Numerical simulations have been performed thanks to the facilities offered by MUST computing center of Université Savoie Mont Blanc and French National Agency of Research PHOENIX ANR-15-CE23-0012 workstations. The violence datasets post-processed and analyzed were supplied by MEDIAEVAL and TECHNICOLOR initiatives. The authors are grateful to the USMB/LISTIC members who have participated in video clip annotations.

Abdourrahmane Mahamane Atto received: the Ph.D. degree in mathematics and application, co-delivered by the University of Rennes I and TELECOM Bretagne, France (2008) and the Habilitation degree for research supervision from the University Grenoble Alpes, France (2015). Since September 2011, he has been an associate professor at the University Savoie Mont Blanc, Polytech Annecy-Chambéry, LISTIC, France. His research interests concern mathematical methods and models for image time series and

References (40)

J. Li et al.
Spatio-temporal deformable 3D convnets with attention for action recognition
Pattern Recognit.
(2020)
K. Simonyan et al.
Two-stream convolutional networks for action recognition in videos
Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1
(2014)
D. Zhang et al.
Learning motion representation for real-time spatio-temporal action localization
Pattern Recognit.
(2020)
L. Chen et al.
Learning principal orientations and residual descriptor for action recognition
Pattern Recognit.
(2019)
P. Wang et al.
Order-aware convolutional pooling for video based action recognition
Pattern Recognit.
(2019)
A. Ullah et al.
Action recognition in video sequences using deep bi-directional LSTM with CNN features
IEEE Access
(2018)
D. Hilbert, Über die Stetige Abbildung Einer Linie auf ein Flächenstück, Springer Berlin Heidelberg, Berlin,...
T. Senst et al.
Crowd violence detection using global motion-compensated lagrangian features and scale-sensitive video-level representation
IEEE Trans. Inf. Forensics Secur.
(2017)
A. Vedaldi et al.
Matconvnet: convolutional neural networks for matlab
Proceedings of the 23rd ACM International Conference on Multimedia
(2015)
P. Sun
Mexconv3d: Matlab mex implementation of the basic operations for 3d (volume) convolutional neural network
Framew. Avaliable Online
(2016)

J. Lin et al.

TSM: temporal shift module for efficient video understanding

The IEEE International Conference on Computer Vision (ICCV)

(2019)

H. Yang et al.

Asymmetric 3D convolutional neural networks for action recognition

Pattern Recognit.

(2019)

C. Feichtenhofer et al.

Convolutional two-stream network fusion for video action recognition

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2016)

J. Carreira et al.

Quo vadis, action recognition? A new model and the kinetics dataset

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2017)

J. Zhang et al.

Domain learning joint with semantic adaptation for human action recognition

Pattern Recognit.

(2019)

H.-S. Fang et al.

Learning pose grammar to encode human body configuration for 3D pose estimation

AAAI Conference on Artificial Intelligence

(2018)

S. Qi et al.

Learning human-object interactions by graph parsing neural networks

The European Conference on Computer Vision (ECCV)

(2018)

Y.-G. Yoon et al.

Feasibility of 3D reconstruction of neural morphology using expansion microscopy and barcode-guided agglomeration

Front. Comput. Neurosci.

(2017)

A. Chadha et al.

Compressed-domain video classification with deep neural networks: ǣthere’s way too much information to decode the matrixǥ

2017 IEEE International Conference on Image Processing (ICIP)

(2017)

J. Valantinas

On the use of space-filling curves in changing image dimensionality

Inf. Technol. Control

(2005)

Cited by (22)

Few-shot Website Fingerprinting attack with Meta-Bias Learning
2022, Pattern Recognition
Citation Excerpt :
In general, this approach is not only constrained by the limitations of hand-crafted features but also lacks interaction between feature representation and classifier model, making the model performance inferior.Deep website fingerprinting attack methods. Motivated by the remarkable success of deep learning techniques in computer vision, natural language processing, and other fields [27–36], several deep learning WF attack methods have been introduced [14–16]. This paradigm can well solve the aforementioned weaknesses as discussed above.
Website fingerprinting (WF) attack aims to identify which website a user is visiting from the traffic data patterns. Whilst existing methods assume many training samples, we investigate a more realistic and scalable few-shot WF attack with only a few labeled training samples per website. To solve this problem, we introduce a novel Meta-Bias Learning (MBL) method for few-shot WF learning. Taking the meta-learning strategy, MBL simulates and optimizes the target tasks. Moreover, a new model parameter factorization idea is introduced for facilitating meta-training with superior task adaptation. Expensive experiments show that our MBL outperforms significantly existing hand-crafted feature and deep learning based alternatives in both closed-world and open-world attack scenarios, at the absence and presence of defense.
Deep Convolutional Neural Network to improve the performances of screening process in LBVS
2022, Expert Systems with Applications
Citation Excerpt :
In the recent years, Deep Learning (DL) is considered as a sub-domain of machine learning (ML) which has grown enormously. This is mainly due to the success gained in the several fields such as voice recognition, text mining (Su & Lu, 2017), object detection, image recognition (Atto et al., 2020; Wu et al., 2020), and many other domains such as genomics and drug discovery (Sun et al., 2017). The recent success of deep learning (DL) offers the opportunity to develop algorithms and tools to extract automatically new representations of specific structures.
Drug design is a research process with a goal of creating a chemical drug to produce the desired biological effect. Because of the long time and the high cost issues associated with traditional drug discovery, there is a need to develop new techniques and strategies to increase the diminishing effectiveness of traditional approaches. Ligand-Based Virtual Screening (LBVS) plays a vital role in the early stage of the drug discovery. It could constitute a possible solution to solve the time and cost problems. Subsequently, the researchers are looking for new methods to find new active compounds and bring them to market in a short time. LBVS can be enhanced by different methods and strategies such as Machine Learning and Deep Learning.
In this paper, a Deep Convolutional Neural Network method is proposed to enhance the performances of Ligand-Based Virtual Screening process (DCNNLB). Two main contributions are presented in this paper, The first contribution consists of designing a model based on Deep Convolutional Neural Network (DCNN) for LBVS. We propose several topological network models to find the one that gives the best performance such as accuracy and recall. For this, many network topology configurations have been proposed, and a variety of parameters have been taken into account. Furthermore, our proposed model is trained on all compounds of all activity classes of the Drug Data Report Database (MDDR). Thus, it presented a mean accuracy of 0.98 for all three MDDR Datasets. The second contribution is to generate a new learning representation in order to better represent chemical compound. This representation is based on the extraction of the automatic features learning from the weights of our proposed model. Consequently, it is very efficient in calculating molecular similarity and performances of the LBVS process. The obtained results with the three different datasets drawn from the MDDR and the performance evaluation with ANOVA test, have proved the superiority in performance of our proposed method compared to the different conventional methods.
TOP-ALCM: A novel video analysis method for violence detection in crowded scenes
2022, Information Sciences
Citation Excerpt :
Serrano et al. [13] summarized the video content as a result of image encoding appearance and leveraging motion areas that are very representative of violent situations. Atto et al. [23] reshaped the 2D + X data of video sequence 1D array indexed by a Hilbert space-filling curve, and the third variable X of the initial file format becomes the second variable in the meta-image format. Zhou et al. [24] constructed a FightNet, which consists of a spatial network used to extract features from RGB images, optical flow, and acceleration images.
Despite the Video Violence Detection (VVD) plays a critical role in video surveillance, it is not trivial in crowded scenes due to the complexity and diversity of violence. Generally, the most typical features of violence are its drastic, disordered, and chaotic motion in contrast to non-violence. To capture these features for violence analysis in a video clip, we propose a novel Angle-level Co-occurrence Matrix (ALCM). Given a video volume, we treat it as a tensor of rank 3, which consists of a bound of fibers in one plane. ALCM records the co-occurrence of two specific quantized angle levels between fibers with their neighbors, which is the distribution of the co-occurrence of fiber pairs with specific similarities in one plane of the tensor of rank 3. To completely characterize the violence in the volume, we compute three ALCMs for three orthogonal planes to form a TOP-ALCM, respectively. We also propose both conventional and deep-learning-based VVD frameworks, in which the former one leverages the features such as entropy, homogeneity, and energy computed from TOP-ALCM for classification, while the latter one directly uses CNN to classify the TOP-ALCMs. Experimental results analysis demonstrates that the proposed TOP-ALCM outperforms the state-of-the-art methods for VVD.
Few-shot website fingerprinting attack
2021, Computer Networks
Citation Excerpt :
Automated feature extraction. Recently, Inspired by the fast advance of deep learning in various fields such as computer vision and natural language processing [28–33], automated feature extraction starts to show great potentials in WF attack [16–18,34]. Different network architectures have been designed and explored with better performance than traditional WF attacks.
Website fingerprinting (WF) attack stands opposite against privacy protection in using the Internet, even when the content details are encrypted, such as Tor networks. Whilst existing difficulty in the preparation of many training samples, we study a more realistic problem — few-shot website fingerprinting attack where only a few training samples per website are available. We introduce a novel Transfer Learning Fingerprinting Attack (TLFA) that can transfer knowledge from the labeled training data of websites disjoint and independent to the target websites. Specifically, TLFA trains a stronger embedding model with the training data collected from non-target websites, which is then leveraged in a task-agnostic manner with a task-specific classifier model fine-tuned on a small set of labeled training data from target websites. We conduct expensive experiments to validate the superiority of our TLFA over the state-of-the-art methods in both closed-world and open-world attacking scenarios, at the absence and presence of strong defense.
Hierarchical evolutionary classification framework for human action recognition using sparse dictionary optimization
2021, Swarm and Evolutionary Computation
Citation Excerpt :
Various schemes involved in HAR can be broadly categorized as, the vision and wearable sensor-based schemes. Vision-based schemes [8–11] refers to the systems that utilize data from cameras mounted at predefined locations. The reliability of this scheme is prone to be affected by environmental barriers like occlusion, clutter, variation in illumination, shadows, etc.
Human action recognition using wearable sensors play a remarkable role in the province of Human-centric Computing. This paper accords a novel sparse representation based hierarchical evolutionary framework for classifying human activities. The main objective of this research is to propose a model for human action recognition that produces superior recognition results. This framework employs data from two inertial sensors namely an accelerometer and a gyroscope. Features like time and frequency-domain features were utilized in this work. This framework operates at multiple levels in the hierarchy, wherein, the output of one level is given as input to the next level in the hierarchy. A novel algorithm for deducing the hierarchical structure based on the input data called Hierarchical Architecture Design (HAD) algorithm is presented. We have also presented a novel Sparse Dictionary Optimization (SDO) algorithm for generating dictionaries that aid in the efficacious sparse representation-based classification. Finally, action recognition is done using the proposed Sparse Representation based Hierarchical (SRH) classifier. The performance analysis of the proposed system was conducted using University of Southern California Human Activity Dataset (USCHAD) and Human Activities and Postural Transitions (HAPT) Dataset. The proposed classification framework attained a very high F-score value of 98.01% and 93.51% for the USCHAD and HAPT datasets respectively.
A deep learning model based on sequential object feature accumulation for sport activity recognition
2023, Multimedia Tools and Applications

View all citing articles on Scopus

Alexandre Benoit received Ph.D. degree in electronics and computer science from the University of Grenoble, INP in 2007. Starting 2008, he is an associate professor at Université Savoie Mont Blanc at LISTIC lab. His main research interest is related to still images and image temporal sequences understanding. He develops deep learning models for remote sensing and multimedia applications such as image classification, pixel level semantic segmentation and regression problems for astrophysics. He develops specific approaches adapted to the sensor and data such as a retina model now distributed in the OpenCV library, and specific deep neural networks and operators for hyperspectral image analysis and astrophysics.

Patrick Lambert received the Ph.D. degree in signal processing from the National Polytechnic Institute of Grenoble, Grenoble, France, in 1983. He is currently a Full Professor with the School of Engineering,Université SavoieMont Blanc, Annecy, France, and amember of the Informatics, Systems, Information and Knowledge Processing Laboratory, Annecy, France. His research interests include image and video analysis, and actually dedicated to non-linear color filtering and automatic image understanding.

View full text

Timed-image based deep learning for action recognition in video sequences

Highlights

Abstract

Introduction

Section snippets

Hilbert space-filling curves: spatial data loads with respect to convolution size

Non-stationarity of Hilbert domain convolution outputs

Homogeneous versus heterogeneous action recognition

Conclusion and prospects

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

IEEE Access

IEEE Trans. Inf. Forensics Secur.

Matconvnet: convolutional neural networks for matlab

Proceedings of the 23rd ACM International Conference on Multimedia

Mexconv3d: Matlab mex implementation of the basic operations for 3d (volume) convolutional neural network

Framew. Avaliable Online

TSM: temporal shift module for efficient video understanding

The IEEE International Conference on Computer Vision (ICCV)

Asymmetric 3D convolutional neural networks for action recognition

Pattern Recognit.

Convolutional two-stream network fusion for video action recognition

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Quo vadis, action recognition? A new model and the kinetics dataset

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Domain learning joint with semantic adaptation for human action recognition

Pattern Recognit.

Learning pose grammar to encode human body configuration for 3D pose estimation

AAAI Conference on Artificial Intelligence

Learning human-object interactions by graph parsing neural networks

The European Conference on Computer Vision (ECCV)

Feasibility of 3D reconstruction of neural morphology using expansion microscopy and barcode-guided agglomeration

Front. Comput. Neurosci.

Compressed-domain video classification with deep neural networks: ǣthere’s way too much information to decode the matrixǥ

2017 IEEE International Conference on Image Processing (ICIP)

On the use of space-filling curves in changing image dimensionality

Inf. Technol. Control