Dual Networks Based 3D Multi-Person Pose Estimation From Monocular Video,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dual Networks Based 3D Multi-Person Pose Estimation From Monocular Video
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 4-26-2022 , DOI: 10.1109/tpami.2022.3170353
Yu Cheng ₁ , Bo Wang ₂ , Robby T. Tan ₁

Affiliation

Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. We also introduce a two-person pose discriminator that enforces natural two-person interactions. Finally, we apply a semi-supervised method to overcome the 3D ground-truth data scarcity. Our evaluations demonstrate the effectiveness of the proposed method and its individual components. Our code and pretrained models are available publicly: https://github.com/3dpose/3D-Multi-Person-Pose.

中文翻译：

基于双网络的单目视频 3D 多人姿势估计

单目 3D 人体姿态估计近年来取得了进展。大多数方法都针对单个人，它们估计以人为中心的坐标（即基于目标人的中心的坐标）中的姿势。因此，这些方法不适用于需要绝对坐标（例如，相机坐标）的多人3D姿态估计。此外，由于人际遮挡和密切的人类交互，多人姿势估计比单人姿势估计更具挑战性。现有的自上而下的多人方法依赖于人体检测（即自上而下的方法），因此会遭受检测错误并且无法在多人场景中产生可靠的姿态估计。同时，现有的不使用人体检测的自下而上的方法不会受到检测错误的影响，但由于它们同时处理场景中的所有人，因此很容易出错，特别是对于小尺度的人。为了应对所有这些挑战，我们建议整合自上而下和自下而上的方法来发挥各自的优势。我们的自上而下的网络估计所有人的人体关节，而不是图像块中的一个，使其对可能的错误边界框具有鲁棒性。我们的自下而上网络结合了基于人体检测的归一化热图，使网络在处理规模变化方面更加稳健。最后，将自上而下和自下而上网络估计的 3D 姿势输入到我们的集成网络中以获得最终 3D 姿势。为了解决训练和测试数据之间的常见差距，我们在测试期间进行优化，通过使用高阶时间约束、重投影损失和骨骼长度正则化来细化估计的 3D 人体姿势。我们还引入了一个两人姿势鉴别器，可以强制自然的两人互动。最后，我们应用半监督方法来克服 3D 地面实况数据稀缺性。我们的评估证明了所提出的方法及其各个组成部分的有效性。我们的代码和预训练模型已公开：https://github.com/3dpose/3D-Multi-Person-Pose。

更新日期：2024-08-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11