Abstract

Current methods of chaos-based action recognition in videos are limited to the artificial feature causing the low recognition accuracy. In this paper, we improve ChaosNet to the deep neural network and apply it to action recognition. First, we extend ChaosNet to deep ChaosNet for extracting action features. Then, we send the features to the low-level LSTM encoder and high-level LSTM encoder for obtaining low-level coding output and high-level coding results, respectively. The agent is a behavior recognizer for producing recognition results. The manager is a hidden layer, responsible for giving behavioral segmentation targets at the high level. Our experiments are executed on two standard action datasets: UCF101 and HMDB51. The experimental results show that the proposed algorithm outperforms the state of the art.

1. Introduction

Human action recognition in videos is an important area in computer vision, receiving sustained attention from the researchers due to its potential applications such as video supervision, entertainment, user interface, sports, video understanding, and patient monitoring. Current action recognition methods can be classified into three categories by action feature: chaos-based feature [1], manual feature [2], and deep learned feature. Inspired by the chaos-based feature and deep learned feature, we propose deep ChaosNet for action recognition to autonomously learn the nonlinear dynamical feature in video action.

In this section, we briefly review the literature of action recognition from the chaos-based feature, manual feature, and deep learned feature.

2.1. Chaos-Based Feature

Ali et al. [3] introduced a human action recognition architecture by using the theory of chaotic systems to model and analyze nonlinear dynamics of human actions. Trajectories of reference joints are used as the representation of the nonlinear dynamical system that is generating the action. Xia et al. [4] proposed a human behavior recognition method based on chaotic invariant features and relevance vector machine (RVM). The trajectory generated by the motion of the human joint points is extracted to represent the nonlinear system of human action behavior, and the time delay is estimated by the C-C method. The chaotic invariants representing human behavior are extracted, and the RVM algorithm is used to identify human behavior. Venkataraman and Turaga [1] proposed to use the descriptor of the shape of the dynamical attractor as the feature representation of the nature of dynamics to solve the drawbacks of traditional approaches.

2.2. Manual Feature

Since human behavior is composed of body movements, general human behavior characteristics are based on the underlying visual movement characteristics. The underlying visual features are easy to extract and represent, and the underlying visual motion features of the same action have a certain degree of robustness under different cameras, so they are widely used in early human behavior recognition. There are two categories on human behavior characteristics: local feature representation and global feature representation. Existing global feature descriptions represent the formation of global spatiotemporal cues through single-frame global features and video frame sequences from aspects of human body contours, posture joint points, and saliency segmentation such as the motion history image algorithm (motion history image, MHI) proposed by Bobick and Davis [5], the adaptation of the shape context algorithm (adaptation of the shape context) proposed by Zhang et al. [6], and the kinematic feature proposed by Ali and Shah [7]. Local feature description of underlying action features is still a hotspot in human behavior recognition research in recent years. Researchers considered the changes in the motion field between frames and proposed various local spatiotemporal feature descriptions, such as STIP [8], MoSIFT [9, 10], and dense trajectories [2, 11].

2.3. Deep Learned Feature

It includes two aspects of deep learning: action convolution features and action timing features. The former uses convolutional neural networks (CNNs) to learn the local depth features of human behavior from different modal data such as RGB image frames and optical flow of behavior videos [12]. On the basis of behavioral convolutional features, it uses methods such as recurrent neural network (RNN), time-series segment network, or linear coding to learn time-series features in multiple stages of behavior development [13]. Due to limited memory capacity of the GPU/CPU and different lengths of behavior duration (shown as different video frames), it is difficult to send all behavior video frames into the deep learning framework for feature learning. Therefore, it is necessary to perform key frame sampling on the behavior video in the behavior recognition process. Most of the existing behavior recognition algorithms use equal sampling [13] or sequential sampling [1416], ignoring the differences in the development process of human behavior, and the key frames obtained are less representative.

3. Deep ChaosNet Framework

Inspired by Wang et al. and Balakrishnan et al. [15, 17], we propose deep ChaosNet framework for action recognition. The framework is illustrated in Figure 1. Deep ChaosNet features are extracted from video frames. And then, the features are sent to the low-level LSTM encoder and high-level LSTM encoder for obtaining low-level coding output and high-level coding results, respectively. The agent is a behavior recognizer for producing recognition results. The agent, based on hierarchical reinforcement learning, is mainly composed of manager and worker. Manager is a hidden layer, responsible for giving behavioral segmentation targets at the high level. Worker determines the spatiotemporal area of the video subsegment that best characterizes the segmentation target according to the segmentation target and outputs the segmentation recognition result.

3.1. Structure of the Network

The network system structure is shown in Figure 2. The manager LSTM unit obtains environmental status information according to the input and derives meaningful behavioral stage goals, which are used as the worker LSTM input to guide the worker to select the spatiotemporal region of the next behavioral video subsegment; the formula is as follows:

is the manager LSTM nonlinear function, and is responsible for mapping the environmental state information to the behavioral stage target . The worker LSTM unit obtains context information according to the input . Based on , we predict the next key frame position , sampling area , and behavior category .

For manager and worker, this project uses a visual attention mechanism to explore areas of salient behavior. The manager attention model mainly explores the significant segment information of the behavior, and the worker attention model assists in searching the behavior key frames and significant areas within the frame. The parameters and are calculated as follows:

3.2. Deep Learning Process

The worker strategy learning process is a standard reinforcement learning process. At each step of the worker, the worker will give a classification prediction result , and then the environment will give a reward , so the goal of worker strategy learning is to minimize the negative value of the reward function. The loss function is

Manager does not directly interact with the environment, and its strategy learning process cannot copy the worker. Compared with manager’s time , the worker strategy is relatively stable, and this strategy directly affects the worker’s behavior classification output results at time . At this point, although the manager is a hidden layer, its strategic goal should be to minimize the negative value of the current reward. The loss function is

4. Experiments and Results

We verify the proposed deep ChaosNet on two standard action datasets: UCF101 [18] and HMDB51 [19]. UCF101 is an action recognition dataset of realistic action videos with 101 action categories collected from YouTube. Videos of the 101 action categories are divided into 25 groups, and each group can contain action videos. Videos from the same group may share some common features, such as similar backgrounds and similar viewpoints. HMDB51 contains 51 types of actions, a total of 6849 videos which are collected from YouTube, Google Video, etc. Each action contains at least 51 videos with a resolution of 320  240.

In the experiments, we construct 7-layer deep ChaosNet for both action datasets. The outputs of the deep ChaosNet are 2048-dim frame features, which are then projected to 512-dim. We use Bi-LSTM with hidden size 512 as the low-level encoder and LSTM with hidden size 256 as the high-level encoder [20]. The worker network consisted of worker LSTM with hidden size 1024. The manager network was composed of manager LSTM with hidden size 256, an attention module, and a linear layer that projected the output of the LSTM into the latent goal space. The environment internal critic was also an RNN, which contained a GRU, a built-in word embedding, a linear layer, and a sigmoid function.

We compare deep ChaosNet with the state-of-the-art deep learning methods [2, 1216]. The comparison results are listed in Table 1. As shown in the table, the proposed deep ChaosNet exceeds the manual features [2] by 7.3% on UCF101 and by 5.8% on HMDB51. Our method is beyond the action convolution features [12] by 0.2% on UCF101 and by 2.6% on HMDB51. The proposed method also outperforms the action timing features [1316] by 1.1% on UCF101 and 0.7% on HMDB51 at least. Overall, our deep ChaosNet method surpasses all the state-of-the-art methods and becomes the new state of the art.

5. Conclusions

We extend ChaosNet to the deep neural network and apply it to action recognition. We deepen the hidden layers of ChaosNet, and then we separately input still frames and motions among frames into the deep network to extract spatial and temporal action features. The features act as the input for the attention-based action recognition framework. We verify our method on two standard action datasets: UCF101 and HMDB51, and the experimental results indicate that the proposed algorithm is competitive compared with the state of the art.

Data Availability

The data used to support the findings of this study are available from UCF101 (https://www.crcv.ucf.edu/research/data-sets/ucf101/), K. Soomro, A. R. Zamir, and M. Shah, UCF101: a dataset of 101 human action classes from videos in the Wild, CRCV-TR-12-01, November 2012, and HMDB51 (https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/), H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: a large video database for human motion recognition, ICCV, 2011.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Natural Science Foundation of Hubei Province (Grant no. 2019CFC850), the Outstanding Youth Science and Technology Innovation Team Project of Colleges and Universities in Hubei Province (Grant no. T201923), the National Natural Science Foundation of China (Grant no. 61761044), and the Cultivation Project of Jingchu University of Technology (Grant no. PY201904).