Multi-Task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-Task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2020-02-24 , DOI: 10.1109/tpami.2020.2976014
Diogo Luvizon , David Picard , Hedi Tabia

Human pose estimation and action recognition are related tasks since both problems are strongly dependent on the human body representation and analysis. Nonetheless, most recent methods in the literature handle the two problems separately. In this article, we propose a multi-task framework for jointly estimating 2D or 3D human poses from monocular color images and classifying human actions from video sequences. We show that a single architecture can be used to solve both problems in an efficient way and still achieves state-of-the-art or comparable results at each task while running with a throughput of more than 100 frames per second. The proposed method benefits from high parameters sharing between the two tasks by unifying still images and video clips processing in a single pipeline, allowing the model to be trained with data from different categories simultaneously and in a seamlessly way. Additionally, we provide important insights for end-to-end training the proposed multi-task model by decoupling key prediction parts, which consistently leads to better accuracy on both tasks. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU RGB+D) demonstrate the effectiveness of our method on the targeted tasks. Our source code and trained weights are publicly available at https://github.com/dluvizon/deephar .

中文翻译：

用于实时 3D 人体姿势估计和动作识别的多任务深度学习

人体姿态估计和动作识别是相关任务，因为这两个问题都强烈依赖于人体表示和分析。尽管如此，文献中的最新方法分别处理这两个问题。在本文中，我们提出了一个多任务框架，用于从单眼彩色图像中联合估计 2D 或 3D 人体姿势，并从视频序列中对人体动作进行分类。我们表明，单一架构可用于以有效的方式解决这两个问题，并且在以每秒 100 帧以上的吞吐量运行时，仍能在每个任务中获得最先进的或可比的结果。通过在单个管道中统一静态图像和视频剪辑处理，所提出的方法受益于两个任务之间的高参数共享，允许使用来自不同类别的数据同时无缝地训练模型。此外，我们通过解耦关键预测部分为端到端训练提出的多任务模型提供了重要的见解，这始终如一地提高了两个任务的准确性。在四个数据集（MPII、Human3.6M、Penn Action 和 NTU RGB+D）上报告的结果证明了我们的方法在目标任务上的有效性。我们的源代码和经过训练的权重可在以下网址公开获得 Penn Action 和 NTU RGB+D）证明了我们的方法在目标任务上的有效性。我们的源代码和经过训练的权重可在以下网址公开获得 Penn Action 和 NTU RGB+D）证明了我们的方法在目标任务上的有效性。我们的源代码和经过训练的权重可在以下网址公开获得https://github.com/dluvizon/deephar .

更新日期：2020-02-24

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>