Skip to main content
Log in

Learning a Robust Part-Aware Monocular 3D Human Pose Estimator via Neural Architecture Search

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Even though most existing monocular 3D human pose estimation methods achieve very competitive performance, they are limited in estimating heterogeneous human body parts with the same decoder architecture. In this work, we present an approach to build a part-aware 3D human pose estimator to better deal with these heterogeneous human body parts. Our proposed method consists of two learning stages: (1) searching suitable decoder architectures for specific parts and (2) training the part-aware 3D human pose estimator built with these optimized neural architectures. Consequently, our searched model is very efficient and compact and can automatically select a suitable decoder architecture to estimate each human body part. In comparison with previous state-of-the-art models built with ResNet-50 network, our method can achieve better performance and reduce 64.4% parameters and 8.5% FLOPs (multiply-adds). We validate the robustness and stability of our searched models by conducting extensive and rigorous ablation experiments. Our method can advance state-of-the-art accuracy on both the single-person and multi-person 3D human pose estimation benchmarks with affordable computational cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The default order: pelvis, right hip, right knee, right ankle, left hip, left knee, left ankle, torso, neck, nose, head, left shoulder, left elbow, left wrist, right shoulder, right elbow, right wrist. We also manually add the thorax part to achieve an alignment with MPII dataset (Andriluka et al. 2014). When we perform model evaluation, we always exclude the thorax part.

References

  • Alldieck, T,. Pons-Moll, G., Theobalt, C., & Magnor, M. (2019). Tex2shape: Detailed full human body geometry from a single image. In IEEE international conference on computer vision (ICCV).

  • Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Baek, S., Kim, K. I., & Kim, T. K. (2018). Augmented skeleton space transfer for depth-based hand pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Baker, B., Gupta, O., Naik, N., & Raskar, R. (2017). Designing neural network architectures using reinforcement learning. In International conference on learning representations (ICLR).

  • Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., & Ilic, S. (2014). 3d pictorial structures for multiple human pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision (ECCV).

  • Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3d pictorial structures for multiple view articulated pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T. J., Yuan, J., & Thalmann, N. M. (2019b). Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In IEEE international conference on computer vision (ICCV).

  • Cai, H., Zhu, L., & Han, S. (2019a). Proxylessnas: Direct neural architecture search on target task and hardware. In International conference on learning representations (ICLR).

  • Chen, C. H., & Ramanan, D. (2017). 3d human pose estimation = 2d pose estimation + matching. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Chen, L. C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., Adam, H., & Shlens, J. (2018). Searching for efficient multi-scale architectures for dense image prediction. In Conference on neural information processing systems (NeurIPS).

  • Chen, Z., Guo, Y., Huang, Y., & Liang, W. (2019b). Learning depth-aware heatmaps for 3d human pose estimation in the wild. In British machine vision conference (BMVC).

  • Chen, Z., Huang, Y., Yu, H., Xue, B., Han, K., Guo, Y., & Wang, L. (2020). Towards part-aware monocular 3d human pose estimation: An architecture search approach. In European conference on computer vision (ECCV).

  • Chen, Y., Yang, T., Zhang, X., Meng, G., Xiao, X., & Sun, J. (2019a). Detnas: Backbone search for object detection. In Conference on neural information processing systems (NeurIPS).

  • Ci, H., Wang, C., Ma, X., & Wang, Y. (2019). Optimizing network structure for 3d human pose estimation. In IEEE international conference on computer vision (ICCV).

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Divvala, S. K., Efros, A. A., & Hebert, M. (2012). How important are “deformable parts” in the deformable parts model? In European conference on computer vision (ECCV).

  • Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., & Cucchiara, R. (2020). Compressed volumetric heatmaps for multi-person 3d pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Fang, H., Xu, Y., Wang, W., Liu, X., & Zhu, S. C. (2018). Learning knowledge-guided pose grammar machine for 3d human pose estimation. In AAAI conference on artificial intelligence (AAAI).

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2009). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 32, 1627–1645.

  • Ganapathi, V., Plagemann, C., Koller, D., & Thrun, S. (2010). Real time motion capture using a single time-of-flight camera. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Ghiasi, G., Lin, T. Y., & Le, Q. V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., & Sun, J. (2020). Single path one-shot neural architecture search with uniform sampling. In European conference on computer vision (ECCV).

  • Gupta, A., Martinez, J., Little. J. J., & Woodham, R. J. (2014). 3d pose from motion for cross-view action recognition via non-linear circulant temporal encoding. In IEEE conference on computer vision and pattern recognition (CVPR).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In IEEE international conference on computer vision (ICCV).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Hossain, M. R. I., & Little, J. J. (2018). Exploiting temporal information for 3d human pose estimation. In European conference on computer vision (ECCV).

  • Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V. (2019). Searching for mobilenetv3. In IEEE international conference on computer vision (ICCV).

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. In IEEE transactions on pattern analysis and machine intelligence (TPAMI).

  • Iskakov, K., Burkov, E., Lempitsky, V., & Malkov, Y. (2019). Learnable triangulation of human pose. In IEEE international conference on computer vision (ICCV).

  • Jiang, H. (2010). 3d human pose reconstruction using millions of exemplars. In International conference on pattern recognition (ICPR)

  • Jiang, W., Kolotouros, N., Pavlakos, G., Zhou, X., & Daniilidis, K. (2020). Coherent reconstruction of multiple humans from a single image. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., & Sheikh, Y. (2015). Panoptic studio: A massively multiview system for social motion capture. In IEEE international conference on computer vision (ICCV).

  • Kanazawa, A., Black, M. J., Jacobs, D. W., & Malik, J. (2018). End-to-end recovery of human shape and pose. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In International conference on learning representations (ICLR).

  • Kocabas, M., Karagoz, S., & Akbas, E. (2019). Self-supervised learning of 3d human pose using multi-view geometry. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Kolotouros, N., Pavlakos, G., & Daniilidis, K. (2019). Convolutional mesh regression for single-image human shape reconstruction. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Lee, H. J., & Chen, Z. (1985). Determination of 3d human body postures from a single view. In Computer vision, graphics, and image processing (CVGIP).

  • Li, S., & Chan, A. B. (2014). 3d human pose estimation from monocular images with deep convolutional neural network. In Asian conference on computer vision (ACCV).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (ECCV).

  • Liu, C., Chen, L. C., Schroff, F., Adam, H., Hua, W., Yuille, A. L., & Fei-Fei, L. (2019a). Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Liu, H., Simonyan, K., & Yang, Y. (2019b). Darts: Differentiable architecture search. In International conference on learning representations (ICLR).

  • Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In International conference on learning representations (ICLR).

  • MacKay, D. J., & Mac Kay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge University Press.

    Google Scholar 

  • Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3d human pose estimation. In IEEE international conference on computer vision (ICCV).

  • Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3d human pose estimation in the wild using improved cnn supervision. In International conference on 3D vision (3DV).

  • Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar, S., Pons-Moll, G., & Theobalt, C. (2018). Single-shot multi-person 3d pose estimation from monocular rgb. In International conference on 3D vision (3DV).

  • Moon, G., Chang, J. Y., & Lee, K. M. (2019). Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In IEEE international conference on computer vision (ICCV)

  • Moreno-Noguer, F. (2017). 3d human pose estimation from a single image via distance matrix regression. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Mueller, F., Davis, M., Bernard, F., Sotnychenko, O., Verschoor, M., Otaduy, M. A., et al. (2019). Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM Transactions on Graphics (TOG) 38, 1–13.

  • Natsume, R., Saito, S., Huang, Z., Chen, W., Ma, C., Li, H., & Morishima, S. (2019). Siclope: Silhouette-based clothed people. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In European conference on computer vision (ECCV).

  • Nibali, A., He, Z., Morgan, S., & Prendergast, L. (2019). 3d human pose estimation with 2d marginal heatmaps. In IEEE winter conference on applications of computer vision (WACV).

  • Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., & Schiele, B. (2018). Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International conference on 3D vision (3DV).

  • Park, S., Hwang, J., & Kwak, N. (2016). 3d human pose estimation using convolutional neural networks with 2d pose information. In European conference on computer vision (ECCV).

  • Pavlakos, G., Zhou, X., & Daniilidis, K. (2018). Ordinal depth supervision for 3d human pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017a). Coarse-to-fine volumetric prediction for single-image 3d human pose. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017b). Harvesting multiple views for marker-less 3d human pose annotations. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (JMLR) 12, 2825–2830.

  • Peng, J., Sun, M., Zhang, Z., Tan, T., & Yan, J. (2019). Efficient neural architecture transformation search in channel-level for object detection. In Conference on neural information processing systems (NeurIPS).

  • Qiu, H., Wang, C., Wang, J., Wang, N., & Zeng, W. (2019). Cross view fusion for 3d human pose estimation. In IEEE international conference on computer vision (ICCV).

  • Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3d human pose estimation from multi-view images. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Rogez, G., & Schmid, C. (2016). Mocap-guided data augmentation for 3d pose estimation in the wild. In Conference on neural information processing systems (NeurIPS).

  • Rogez, G., Weinzaepfel, P., & Schmid, C. (2017). Lcr-net: Localization-classification-regression for human pose. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Romero, J., Tzionas, D., & Black, M. J. (2017). Embodied hands: Modeling and capturing hands and bodies together. In ACM transactions on graphics (TOG).

  • Sárándi, I., Linder, T., Arras, K. O., & Leibe, B. (2018). Synthetic occlusion augmentation with volumetric heatmaps for the 2018 eccv posetrack challenge on 3d human pose estimation. In Workshop at european conference on computer vision (ECCVW).

  • Sárándi, I., Linder, T., Arras, K., & Leibe, B. (2020). Metric-scale truncation-robust heatmaps for 3d human pose estimation. In IEEE international conference on automatic face and gesture recognition (FG).

  • Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Sun, X., Shang, J., Liang, S., & Wei, Y. (2017). Compositional human pose regression. In IEEE international conference on computer vision (ICCV).

  • Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In European conference on computer vision (ECCV).

  • Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Tang, W., & Wu, Y. (2019). Does learning specific features for related parts help human pose estimation? In IEEE conference on computer vision and pattern recognition (CVPR).

  • Tang, W., Yu, P., & Wu, Y. (2018). Deeply learned compositional models for human pose estimation. In European conference on computer vision (ECCV).

  • Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In Computer vision and image understanding (CVIU).

  • Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3d pose estimation from a single image. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Tu, H., Wang, C., & Zeng, W. (2020). Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In European conference on computer vision (ECCV).

  • Varol, G., Ceylan, D., Russell, B., Yang, J., Yumer, E., Laptev, I., & Schmid, C. (2018). Bodynet: Volumetric inference of 3d human body shapes. In European conference on computer vision (ECCV).

  • Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Wang, J., Huang, S., Wang, X., & Tao, D. (2019). Not all parts are created equal: 3d pose estimation by modeling bi-directional dependencies of body parts. In IEEE international conference on computer vision (ICCV).

  • Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Wu, X., Finnegan, D., O’Neill, E., & Yang. Y. L. (2018). Handmap: Robust hand pose estimation via intermediate dense guidance map supervision. In European conference on computer vision (ECCV).

  • Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou, J. T., & Yuan, J. (2019). A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In IEEE international conference on computer vision (ICCV)

  • Xu, Y., Xie, L., Zhang, X., Chen, X., Qi, G. J., Tian, Q., & Xiong, H. (2020). Pc-darts: Partial channel connections for memory-efficient architecture search. In International Conference on Learning Representations (ICLR).

  • Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., & Wang, X. (2018). 3d human pose estimation in the wild by adversarial learning. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Yasin, H., Iqbal, U., Kruger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3d pose estimation from a single image. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhang, Y., Qiu, Z., Liu, J., Yao, T., Liu, D., & Mei, T. (2019). Customizable architecture search for semantic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Zheng, Z., Yu, T., Wei, Y., Dai, Q., & Liu, Y. (2019). Deephuman: 3d human reconstruction from a single image. In IEEE international conference on computer vision (ICCV).

  • Zhou, K., Han, X., Jiang, N., Jia, K., & Lu, J. (2019). Hemlets pose: Learning part-centric heatmap triplets for accurate 3d human pose estimation. In IEEE international conference on computer vision (ICCV).

  • Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017). Towards 3d human pose estimation in the wild: a weakly-supervised approach. In IEEE international conference on computer vision (ICCV).

  • Zhou, X., Zhu, M., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2016). Sparseness meets deepness: 3d human pose estimation from monocular video. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhou, X., Zhu, M., Pavlakos, G., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2018). Monocap: Monocular human motion capture using a cnn coupled with a geometric prior. In IEEE transactions on pattern analysis and machine intelligence (TPAMI).

  • Zoph, B., & Le, Q. V. (2017). Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR).

Download references

Acknowledgements

This work was jointly supported by National Key Research and Development Program of China Grant No. 2018AAA0100400, National Natural Science Foundation of China (61633021, 61721004, 61806194, U1803261, and 61976132), Beijing Nova Program (Z201100006820079), Shandong Provincial Key Research and Development Program (2019JZZY010119), Key Research Program of Frontier Sciences CAS Grant No.ZDBS-LY-JSC032, and CAS-AIR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liang Wang.

Additional information

Communicated by Gregory Rogez.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Huang, Y., Yu, H. et al. Learning a Robust Part-Aware Monocular 3D Human Pose Estimator via Neural Architecture Search. Int J Comput Vis 130, 56–75 (2022). https://doi.org/10.1007/s11263-021-01525-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01525-0

Keywords

Navigation