当前位置: X-MOL 学术Sensors › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera.
Sensors ( IF 3.4 ) Pub Date : 2020-03-25 , DOI: 10.3390/s20071825
Huy Hieu Pham 1, 2, 3 , Houssam Salmane 4 , Louahdi Khoudour 1 , Alain Crouzil 2 , Sergio A Velastin 5, 6, 7 , And Pablo Zegers 8
Affiliation  

We present a deep learning-based multitask framework for joint 3D human pose estimation and action recognition from RGB sensors using simple cameras. The approach proceeds along two stages. In the first, a real-time 2D pose detector is run to determine the precise pixel location of important keypoints of the human body. A two-stream deep neural network is then designed and trained to map detected 2D keypoints into 3D poses. In the second stage, the Efficient Neural Architecture Search (ENAS) algorithm is deployed to find an optimal network architecture that is used for modeling the spatio-temporal evolution of the estimated 3D poses via an image-based intermediate representation and performing action recognition. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets verify the effectiveness of the proposed method on the targeted tasks. Moreover, we show that the method requires a low computational budget for training and inference. In particular, the experimental results show that by using a monocular RGB sensor, we can develop a 3D pose estimation and human action recognition approach that reaches the performance of RGB-depth sensors. This opens up many opportunities for leveraging RGB cameras (which are much cheaper than depth cameras and extensively deployed in private and public places) to build intelligent recognition systems.

中文翻译:


用于来自单个 RGB 相机的联合 3D 姿势估计和动作识别的统一深度框架。



我们提出了一种基于深度学习的多任务框架,用于使用简单相机从 RGB 传感器进行联合 3D 人体姿势估计和动作识别。该方法分两个阶段进行。首先,运行实时 2D 姿态检测器来确定人体重要关键点的精确像素位置。然后设计并训练双流深度神经网络,将检测到的 2D 关键点映射到 3D 姿势。在第二阶段,部署高效神经架构搜索(ENAS)算法来寻找最佳网络架构,用于通过基于图像的中间表示对估计的 3D 姿势的时空演化进行建模并执行动作识别。在Human3.6M、MSR Action3D和SBU Kinect交互数据集上的实验验证了该方法在目标任务上的有效性。此外,我们表明该方法需要较低的训练和推理计算预算。特别是,实验结果表明,通过使用单目 RGB 传感器,我们可以开发出达到 RGB 深度传感器性能的 3D 姿态估计和人体动作识别方法。这为利用 RGB 相机(比深度相机便宜得多并且广泛部署在私人和公共场所)来构建智能识别系统提供了许多机会。
更新日期:2020-03-26
down
wechat
bug