当前位置: X-MOL 学术Neural Comput. & Applic. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A deeply coupled ConvNet for human activity recognition using dynamic and RGB images
Neural Computing and Applications ( IF 4.5 ) Pub Date : 2020-05-21 , DOI: 10.1007/s00521-020-05018-y
Tej Singh , Dinesh Kumar Vishwakarma

This work is motivated by the tremendous achievement of deep learning models for computer vision tasks, particularly for human activity recognition. It is gaining more attention due to the numerous applications in real life, for example smart surveillance system, human–computer interaction, sports action analysis, elderly healthcare, etc. Recent days, the acquisition and interface of multimodal data are straightforward due to the invention of low-cost depth devices. Several approaches have been developed based on RGB-D (depth) evidence at the cost of additional equipment’s setup and high complexity. Contrarily, the methods that utilize RGB frames provide inferior performance due to the absence of depth evidence and these approaches require to less hardware, simple and easy to generalize using only color cameras. In this work, a deeply coupled ConvNet for human activity recognition proposed that utilizes the RGB frames at the top layer with bi-directional long short-term memory (Bi-LSTM). At the bottom layer, the CNN model is trained with a single dynamic motion image. For the RGB frames, the CNN-Bi-LSTM model is trained end-to-end learning to refine the feature of the pre-trained CNN, while dynamic images stream is fine-tuned with the top layers of the pre-trained model to extract temporal information in videos. The features obtained from both the data streams are fused at decision level after the softmax layer with different late fusion techniques and achieved high accuracy with max fusion. The performance accuracy of the model is assessed using four standard single as well as multiple person activities RGB-D (depth) datasets. The highest classification accuracies achieved on human action datasets are compared with similar state of the art and found significantly higher margin such as 2% on SBU Interaction, 4% on MIVIA Action, 1% on MSR Action Pair, and 4% on MSR Daily Activity.



中文翻译:

深度耦合的ConvNet,可使用动态和RGB图像识别人类活动

这项工作受到计算机视觉任务特别是人类活动识别的深度学习模型的巨大成就的推动。由于现实生活中的众多应用,例如智能监控系统,人机交互,运动动作分析,老年人保健等,它引起了越来越多的关注。近来,由于本发明,多模式数据的获取和接口变得简单明了。低成本的深度设备。已经基于RGB-D(深度)证据开发了几种方法,但要付出额外设备设置和高复杂性的代价。相反,由于缺乏深度证据,利用RGB帧的方法提供的性能较差,并且这些方法需要较少的硬件,简单且易于仅使用彩色相机进行概括。在这项工作中 提出了一种用于人类活动识别的深层耦合ConvNet,该层利用顶层的RGB帧和双向长短期记忆(Bi-LSTM)。在底层,使用单个动态运动图像训练CNN模型。对于RGB帧,对CNN-Bi-LSTM模型进行了端到端训练,以完善预训练CNN的功能,而动态图像流则通过预训练模型的顶层进行微调,以达到提取视频中的时间信息。从两个数据流中获得的特征在softmax层之后使用不同的后期融合技术在决策级别进行融合,并通过最大融合实现了高精度。使用四个标准的单人或多人活动RGB-D(深度)数据集评估模型的性能准确性。

更新日期:2020-05-21
down
wechat
bug