Towards High Performance Human Keypoint Detection

Zhang, Jing; Chen, Zhe; Tao, Dacheng

doi:10.1007/s11263-021-01482-8

Towards High Performance Human Keypoint Detection

Published: 01 July 2021

Volume 129, pages 2639–2662, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1878 Accesses
44 Citations
1 Altmetric
Explore all metrics

Abstract

Human keypoint detection from a single image is very challenging due to occlusion, blur, illumination, and scale variance. In this paper, we address this problem from three aspects by devising an efficient network structure, proposing three effective training strategies, and exploiting four useful postprocessing techniques. First, we find that context information plays an important role in reasoning human body configuration and invisible keypoints. Inspired by this, we propose a cascaded context mixer (CCM), which efficiently integrates spatial and channel context information and progressively refines them. Then, to maximize CCM’s representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy by exploiting abundant unlabeled data. It enables CCM to learn discriminative features from massive diverse poses. Third, we present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy. Extensive experiments on the MS COCO keypoint detection benchmark demonstrate the superiority of the proposed method over representative state-of-the-art (SOTA) methods. Our single model achieves comparable performance with the winner of the 2018 COCO Keypoint Detection Challenge. The final ensemble model sets a new SOTA on this benchmark. The source code will be released at https://github.com/chaimi2013/CCM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

Tausif Diwan, G. Anirudh & Jitendra V. Tembhurne

End-to-End Object Detection with Transformers

Notes

The instances with at least one annotated keypoint are counted.
https://challenger.ai/competition/keypoint/.
A video demo can be found in https://github.com/chaimi2013/CCM/video.
http://cocodataset.org/index.htm#keypoints-leaderboard.

References

Andriluka, M., Iqbal, U., Ensafutdinov, E., Pishchulin, L., Milan, A., & Gall, J. B. S. (2018). PoseTrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Baradel, F., Wolf, C., Mille, J., & Taylor, G. W. (2018). Glimpse clouds: Human activity recognition from unstructured feature points. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 469–478).
Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94(2), 115.
Article Google Scholar
Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang, H., Zhou, X., Zhou, E., Zhang, X., & Sun, J. (2020). Learning delicate local representations for multi-person pose estimation. In Proceedings of the European conference on computer vision (ECCV)
Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7291–7299).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., & Sun, J. (2018b) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7103–7112).
Chen, Z., Zhang, J., & Tao, D. (2020). Recursive context routing for object detection. International Journal of Computer Vision, 129, 142–160.
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 248–255).
Fang, H. S., Xie, S., Tai, Y. W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 2334–2343).
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8). IEEE.
Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M., & Tran, D. (2018). Detect-and-track: Efficient pose estimation in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 350–359).
Hattori, H., Lee, N., Boddeti, V. N., Beainy, F., Kitani, K. M., & Kanade, T. (2018). Synthesizing a scene-specific pedestrian detector and pose estimator for static video surveillance. International Journal of Computer Vision, 126(9), 1027–1044.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 2961–2969).
Holt, B., Ong, EJ., Cooper, H., & Bowden, R. (2011). Putting the pieces together: Connected poselets for human pose estimation. In Proceedings of the IEEE international conference on computer vision workshops (ICCVW) (pp. 1196–1201). IEEE.
Hossain, M. R. I., & Little, J. J. (2018). Exploiting temporal information for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 69–86). Springer.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7132–7141).
Huang, S., Gong, M., & Tao, D. (2017). A coarse-fine network for keypoint localization. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 3028–3037).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the international conference on machine learning (ICML) (pp. 448–456).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
Lee, C. Y., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015). Deeply-supervised nets. In Artificial intelligence and statistics (pp. 562–570).
Li, W., Wang, Z., Yin, B., Peng, Q., Du, Y., Xiao, T., Yu, G., Lu, H., Wei, Y., & Sun, J. (2019). Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014) Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision (ECCV) (pp. 740–755).
Lin TY, Dollár P, Girshick R, He K, Hariharan B, & Belongie S (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2117–2125).
Liu, J., Shahroudy, A., Xu, D., Kot, A. C., & Wang, G. (2018). Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE transactions on pattern analysis and machine intelligence, 40(12), 3007–3021.
Article Google Scholar
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., et al. (2020). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128(2), 261–318.
Article Google Scholar
Ma, B., Zhang, J., Xia, Y., & Tao, D. (2020). Auto learning attention. In Advances in neural information processing systems (Vol. 33).
Mazhar, O., Ramdani, S., Navarro, B., Passama, R., & Cherubini, A. (2018). Towards real-time physical human-robot interaction using skeleton information and hand gestures. In Proceedings of the 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1–6). IEEE.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 483–499).
Newell, A., Huang, Z., & Deng, J. (2017). Associative embedding: End-to-end learning for joint detection and grouping. In Advances in neural information processing systems (pp. 2277–2287).
Ni, B., Li, T., & Yang, X. (2017). Learning semantic-aligned action representation. IEEE Transactions on Neural Networks and Learning Systems, 29(8), 3715–3725.
Article Google Scholar
Ouyang, W., Zeng, X., & Wang, X. (2016). Learning mutual visibility relationship for pedestrian detection with a deep model. International Journal of Computer Vision, 120(1), 14–27.
Article MathSciNet Google Scholar
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., & Murphy, K. (2017). Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4903–4911).
Papandreou, G., Zhu, T., Chen, LC., Gidaris, S., Tompson, J., & Murphy, K. (2018) . Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European conference on computer vision (ECCV) (pp. 269–286).
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In Advances in neural information processing systems workshops.
Pavlakos, G., Zhou, X., & Daniilidis, K. (2018a). Ordinal depth supervision for 3d human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7307–7316).
Pavlakos, G., Zhu, L., Zhou, X., & Daniilidis, K. (2018b). Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 459–468).
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, PV., & Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4929–4937).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).
Rhodin, H., Salzmann, M., & Fua, P. (2018). Unsupervised geometry-aware representation for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 750–767).
Rogez, G., Rihan, J., Orrite-Uruñuela, C., & Torr, P. H. (2012). Fast human pose detection using randomized hierarchical cascades of rejectors. International Journal of Computer Vision, 99(1), 25–52.
Article MathSciNet Google Scholar
Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5693–5703).
Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In Proceedings of the European conference on computer vision (ECCV) (pp. 529–545).
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1653–1660).
Varadarajan, J., Subramanian, R., Bulò, S. R., Ahuja, N., Lanz, O., & Ricci, E. (2018). Joint estimation of human pose and conversational groups from social scenes. International Journal of Computer Vision, 126(2–4), 410–429.
Article MathSciNet Google Scholar
Wagemans, J., Elder, JH., Kubovy, M., Palmer, SE., Peterson, MA., Singh, M., & von der Heydt, R. (2012). A century of gestalt psychology in visual perception: I. perceptual grouping and figure–ground organization. Psychological bulletin 138(6):1172
Wang, F., & Li, Y. (2013). Beyond physical connections: Tree models in human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 596–603).
Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV) (pp. 466–481).
Yang, Q., Yang, R., Davis, J., & Nistér, D. (2007). Spatial-depth super resolution for range images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE (pp. 1–8).
Yang, W., Li, S., Ouyang, W., Li, H., & Wang, X. (2017). Learning feature pyramids for human pose estimation. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1281–1290).
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., & Wang, X. (2018). 3d human pose estimation in the wild by adversarial learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5255–5264).
Yang, Y., & Ramanan, D. (2013). Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2878–2890.
Article Google Scholar
Zhang, F., Zhu, X., Dai, H., Ye, M., & Zhu, C. (2020). Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7093–7102).
Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., & Jia, J. (2019a). Human pose estimation with spatial contextual information. arXiv preprint arXiv:1901.01760
Zhang, J., & Tao, D. (2020). Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet of Things Journal.
Zhang, SH., & Li, R., et al (2019b). Pose2seg: Detection free human instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Download references

Author information

Authors and Affiliations

School of Computer Science, Faculty of Engineering, The University of Sydney, Darlington, NSW, 2008, Australia
Jing Zhang, Zhe Chen & Dacheng Tao

Authors

Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dacheng Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dacheng Tao.

Additional information

Communicated by Karteek Alahari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by Australian Research Council Projects FL-170100117, DP-180103424, IH-180100002.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, J., Chen, Z. & Tao, D. Towards High Performance Human Keypoint Detection. Int J Comput Vis 129, 2639–2662 (2021). https://doi.org/10.1007/s11263-021-01482-8

Download citation

Received: 04 February 2020
Accepted: 24 May 2021
Published: 01 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11263-021-01482-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards High Performance Human Keypoint Detection

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

End-to-End Object Detection with Transformers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards High Performance Human Keypoint Detection

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

End-to-End Object Detection with Transformers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation