当前位置: X-MOL 学术EURASIP J. Image Video Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improved two-stream model for human action recognition
EURASIP Journal on Image and Video Processing ( IF 2.4 ) Pub Date : 2020-06-17 , DOI: 10.1186/s13640-020-00501-x
Yuxuan Zhao , Ka Lok Man , Jeremy Smith , Kamran Siddique , Sheng-Uei Guan

This paper addresses the recognitions of human actions in videos. Human action recognition can be seen as the automatic labeling of a video according to the actions occurring in it. It has become one of the most challenging and attractive problems in the pattern recognition and video classification fields. The problem itself is difficult to solve by traditional video processing methods because of several challenges such as the background noise, sizes of subjects in different videos, and the speed of actions. Derived from the progress of deep learning methods, several directions are developed to recognize a human action from a video, such as the long-short-term memory (LSTM)-based model, two-stream convolutional neural network (CNN) model, and the convolutional 3D model.In this paper, we focus on the two-stream structure. The traditional two-stream CNN network solves the problem that CNNs do not have satisfactory performance on temporal features. By training a temporal stream, which uses the optical flow as the input, a CNN can have the ability to extract temporal features. However, the optical flow only contains limited temporal information because it only records the movements of pixels on the x-axis and the y-axis. Therefore, we attempt to design and implement a new two-stream model by using an LSTM-based model in its spatial stream to extract both spatial and temporal features in RGB frames. In addition, we implement a DenseNet in the temporal stream to improve the recognition accuracy. This is in-contrast to traditional approaches which typically utilize the spatial stream for extracting only spatial features. The quantitative evaluation and experiments are conducted on the UCF-101 dataset, which is a well-developed public video dataset. For the temporal stream, we choose the optical flow of UCF-101. Images in the optical flow are provided by the Graz University of Technology. The experimental result shows that the proposed method outperforms the traditional two-stream CNN method with an accuracy of at least 3%. For both spatial and temporal streams, the proposed model also achieves higher recognition accuracies. In addition, compared with the state of the art methods, the new model can still have the best recognition performance.

中文翻译:

改进的两流模型,用于人类动作识别

本文介绍了视频中人类行为的识别。可以将人类动作识别视为根据视频中发生的动作自动标记视频。它已经成为模式识别和视频分类领域最具挑战性和吸引力的问题之一。由于一些挑战,例如背景噪声,不同视频中的拍摄对象大小以及动作速度,这些问题本身很难用传统的视频处理方法解决。源自深度学习方法的进步,人们开发了几个方向来识别视频中的人类动作,例如基于长短期记忆(LSTM)的模型,两流卷积神经网络(CNN)模型以及卷积3D模型。在本文中,我们重点介绍两流结构。传统的两流CNN网络解决了CNN在时间特征上性能不令人满意的问题。通过训练使用光流作为输入的时间流,CNN可以提取时间特征。但是,光流仅包含有限的时间信息,因为它仅记录像素在屏幕上的运动。x轴和y-轴。因此,我们尝试通过在空间流中使用基于LSTM的模型来设计和实现新的两流模型,以提取RGB帧中的空间和时间特征。此外,我们在时间流中实现了DenseNet,以提高识别精度。这与通常利用空间流仅提取空间特征的传统方法相反。在UCF-101数据集上进行了定量评估和实验,该数据集是一个完善的公共视频数据集。对于时间流,我们选择UCF-101的光流。光学流中的图像由格拉茨技术大学提供。实验结果表明,该方法优于传统的两流CNN方法,精度至少为3%。对于时空流,所提出的模型还实现了更高的识别精度。另外,与现有技术方法相比,新模型仍可以具有最佳识别性能。
更新日期:2020-06-17
down
wechat
bug