Skip to main content
Log in

Towards High Performance Human Keypoint Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Human keypoint detection from a single image is very challenging due to occlusion, blur, illumination, and scale variance. In this paper, we address this problem from three aspects by devising an efficient network structure, proposing three effective training strategies, and exploiting four useful postprocessing techniques. First, we find that context information plays an important role in reasoning human body configuration and invisible keypoints. Inspired by this, we propose a cascaded context mixer (CCM), which efficiently integrates spatial and channel context information and progressively refines them. Then, to maximize CCM’s representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy by exploiting abundant unlabeled data. It enables CCM to learn discriminative features from massive diverse poses. Third, we present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy. Extensive experiments on the MS COCO keypoint detection benchmark demonstrate the superiority of the proposed method over representative state-of-the-art (SOTA) methods. Our single model achieves comparable performance with the winner of the 2018 COCO Keypoint Detection Challenge. The final ensemble model sets a new SOTA on this benchmark. The source code will be released at https://github.com/chaimi2013/CCM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. The instances with at least one annotated keypoint are counted.

  2. https://challenger.ai/competition/keypoint/.

  3. A video demo can be found in https://github.com/chaimi2013/CCM/video.

  4. http://cocodataset.org/index.htm#keypoints-leaderboard.

References

  • Andriluka, M., Iqbal, U., Ensafutdinov, E., Pishchulin, L., Milan, A., & Gall, J. B. S. (2018). PoseTrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Baradel, F., Wolf, C., Mille, J., & Taylor, G. W. (2018). Glimpse clouds: Human activity recognition from unstructured feature points. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 469–478).

  • Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94(2), 115.

    Article  Google Scholar 

  • Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang, H., Zhou, X., Zhou, E., Zhang, X., & Sun, J. (2020). Learning delicate local representations for multi-person pose estimation. In Proceedings of the European conference on computer vision (ECCV)

  • Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7291–7299).

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., & Sun, J. (2018b) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7103–7112).

  • Chen, Z., Zhang, J., & Tao, D. (2020). Recursive context routing for object detection. International Journal of Computer Vision, 129, 142–160.

    Article  Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 248–255).

  • Fang, H. S., Xie, S., Tai, Y. W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 2334–2343).

  • Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8). IEEE.

  • Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M., & Tran, D. (2018). Detect-and-track: Efficient pose estimation in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 350–359).

  • Hattori, H., Lee, N., Boddeti, V. N., Beainy, F., Kitani, K. M., & Kanade, T. (2018). Synthesizing a scene-specific pedestrian detector and pose estimator for static video surveillance. International Journal of Computer Vision, 126(9), 1027–1044.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 2961–2969).

  • Holt, B., Ong, EJ., Cooper, H., & Bowden, R. (2011). Putting the pieces together: Connected poselets for human pose estimation. In Proceedings of the IEEE international conference on computer vision workshops (ICCVW) (pp. 1196–1201). IEEE.

  • Hossain, M. R. I., & Little, J. J. (2018). Exploiting temporal information for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 69–86). Springer.

  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7132–7141).

  • Huang, S., Gong, M., & Tao, D. (2017). A coarse-fine network for keypoint localization. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 3028–3037).

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the international conference on machine learning (ICML) (pp. 448–456).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

  • Lee, C. Y., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015). Deeply-supervised nets. In Artificial intelligence and statistics (pp. 562–570).

  • Li, W., Wang, Z., Yin, B., Peng, Q., Du, Y., Xiao, T., Yu, G., Lu, H., Wei, Y., & Sun, J. (2019). Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014) Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision (ECCV) (pp. 740–755).

  • Lin TY, Dollár P, Girshick R, He K, Hariharan B, & Belongie S (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2117–2125).

  • Liu, J., Shahroudy, A., Xu, D., Kot, A. C., & Wang, G. (2018). Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE transactions on pattern analysis and machine intelligence, 40(12), 3007–3021.

    Article  Google Scholar 

  • Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., et al. (2020). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128(2), 261–318.

    Article  Google Scholar 

  • Ma, B., Zhang, J., Xia, Y., & Tao, D. (2020). Auto learning attention. In Advances in neural information processing systems (Vol. 33).

  • Mazhar, O., Ramdani, S., Navarro, B., Passama, R., & Cherubini, A. (2018). Towards real-time physical human-robot interaction using skeleton information and hand gestures. In Proceedings of the 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1–6). IEEE.

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 483–499).

  • Newell, A., Huang, Z., & Deng, J. (2017). Associative embedding: End-to-end learning for joint detection and grouping. In Advances in neural information processing systems (pp. 2277–2287).

  • Ni, B., Li, T., & Yang, X. (2017). Learning semantic-aligned action representation. IEEE Transactions on Neural Networks and Learning Systems, 29(8), 3715–3725.

    Article  Google Scholar 

  • Ouyang, W., Zeng, X., & Wang, X. (2016). Learning mutual visibility relationship for pedestrian detection with a deep model. International Journal of Computer Vision, 120(1), 14–27.

    Article  MathSciNet  Google Scholar 

  • Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., & Murphy, K. (2017). Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4903–4911).

  • Papandreou, G., Zhu, T., Chen, LC., Gidaris, S., Tompson, J., & Murphy, K. (2018) . Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European conference on computer vision (ECCV) (pp. 269–286).

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In Advances in neural information processing systems workshops.

  • Pavlakos, G., Zhou, X., & Daniilidis, K. (2018a). Ordinal depth supervision for 3d human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7307–7316).

  • Pavlakos, G., Zhu, L., Zhou, X., & Daniilidis, K. (2018b). Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 459–468).

  • Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, PV., & Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4929–4937).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).

  • Rhodin, H., Salzmann, M., & Fua, P. (2018). Unsupervised geometry-aware representation for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 750–767).

  • Rogez, G., Rihan, J., Orrite-Uruñuela, C., & Torr, P. H. (2012). Fast human pose detection using randomized hierarchical cascades of rejectors. International Journal of Computer Vision, 99(1), 25–52.

    Article  MathSciNet  Google Scholar 

  • Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5693–5703).

  • Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In Proceedings of the European conference on computer vision (ECCV) (pp. 529–545).

  • Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1653–1660).

  • Varadarajan, J., Subramanian, R., Bulò, S. R., Ahuja, N., Lanz, O., & Ricci, E. (2018). Joint estimation of human pose and conversational groups from social scenes. International Journal of Computer Vision, 126(2–4), 410–429.

    Article  MathSciNet  Google Scholar 

  • Wagemans, J., Elder, JH., Kubovy, M., Palmer, SE., Peterson, MA., Singh, M., & von der Heydt, R. (2012). A century of gestalt psychology in visual perception: I. perceptual grouping and figure–ground organization. Psychological bulletin 138(6):1172

  • Wang, F., & Li, Y. (2013). Beyond physical connections: Tree models in human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 596–603).

  • Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV) (pp. 466–481).

  • Yang, Q., Yang, R., Davis, J., & Nistér, D. (2007). Spatial-depth super resolution for range images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE (pp. 1–8).

  • Yang, W., Li, S., Ouyang, W., Li, H., & Wang, X. (2017). Learning feature pyramids for human pose estimation. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1281–1290).

  • Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., & Wang, X. (2018). 3d human pose estimation in the wild by adversarial learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5255–5264).

  • Yang, Y., & Ramanan, D. (2013). Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2878–2890.

    Article  Google Scholar 

  • Zhang, F., Zhu, X., Dai, H., Ye, M., & Zhu, C. (2020). Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7093–7102).

  • Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., & Jia, J. (2019a). Human pose estimation with spatial contextual information. arXiv preprint arXiv:1901.01760

  • Zhang, J., & Tao, D. (2020). Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet of Things Journal.

  • Zhang, SH., & Li, R., et al (2019b). Pose2seg: Detection free human instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dacheng Tao.

Additional information

Communicated by Karteek Alahari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by Australian Research Council Projects FL-170100117, DP-180103424, IH-180100002.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Chen, Z. & Tao, D. Towards High Performance Human Keypoint Detection. Int J Comput Vis 129, 2639–2662 (2021). https://doi.org/10.1007/s11263-021-01482-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01482-8

Keywords

Navigation