Skip to main content
Log in

Attention Refined Network for Human Pose Estimation

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Recently, multi-scale feature fusion has been considered as one of the most important issues in designing convolutional neural networks (CNNs). However, most existing methods directly add the corresponding layers together without considering the semantic gaps between them, which may lead to inadequately feature fusion results. In this paper, we propose an attention refined network (HR-ARNet) to enhance multi-scale feature fusion for human pose estimation. The HR-ARNet employs channel and spatial attention mechanisms to reinforce important features and suppress unnecessary ones. To tackle the problem of inconsistent among keypoints, we utilize self-attention strategy to model long-range keypoints dependencies. We also propose to use the focus loss, which modifies the commonly used square error loss function to let it mainly focus on top K ‘hard’ keypoints during training. Focus loss selects ‘hard’ keypoints based on the training loss and only backpropagates the gradients from the selected keypoints. Experiments on human pose estimation benchmark, MPII Human Pose Dataset and COCO Keypoint Dataset, show that our method can boost the performance of state-of-the-art human pose estimation networks including HRNet (high-resolution net) (Sun et al., Proceedings of the IEEE conference on computer vision and pattern recognition, 2019). The code and models are available at: http://github/tongjiangwei/ARNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Pfister T, Charles J, Zisserman A (2015) Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE international conference on computer vision, pp 1913–1921

  2. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition.. In: 32nd AAAI conference on artificial intelligence

  3. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision. Springer, Cham, pp 483–499

  4. Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 466–481

  5. Yang W, Li S, Ouyang W, Li H, Wang X (2017) Learning feature pyramids for human pose estimation. In: proceedings of the IEEE international conference on computer vision, pp 1281–1290

  6. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5693–5703

  7. Zhang H, Ouyang H, Liu S, Qi X, Shen X, Yang R, Jia J (2019) Human pose estimation with spatial contextual information. ArXiv preprint arXiv:1901.01760

  8. Zhang H, Goodfellow I, Metaxas D, Odena A (2018) Self-attention generative adversarial networks. ArXiv preprint arXiv:1805.08318

  9. Woo S, Park J, Lee JY, So Kweon I (2018) CBAM: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19

  10. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  11. Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3686–3693

  12. Lin TY et al (2014) Microsoft COCO: Common objects in context. In: European conference on computer vision. Springer, Cham, pp 740–755

  13. Andriluka M, Roth S, Schiele B (2009) Pictorial structures revisited: people detection and articulated pose estimation. In: 2009 IEEE conference on computer vision and pattern recognition, pp 1014–1021

  14. Chen X, Yuille AL (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in neural information processing systems, pp 1736–1744

  15. Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1653–1660

  16. Tompson JJ, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems, pp 1799–1807

  17. Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4724–4732

  18. Chou CJ, Chien JT, Chen HT (2018) Self adversarial training for human pose estimation. In: 2018 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 17–30

  19. Tang W, Yu P, Wu Y (2018) Deeply learned compositional models for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp 190–206

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  21. Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7103–7112

  22. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  23. Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler PV, Schiele B (2016) Deepcut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4929–4937

  24. Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016) Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In: European conference on computer vision. Springer, Cham, pp 34–50

  25. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. ArXiv preprint arXiv:1409.1556

  26. Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299

  27. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. ArXiv preprint arXiv:1409.0473

  28. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 5907–5915

  29. Chen X, Mishra N, Rohaninejad M, Abbeel P (2017) Pixelsnail: an improved autoregressive generative model. ArXiv preprint arXiv:1712.09763

  30. Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order Boltzmann machine. Adv Neural Inf Process Syst 23:1243–1251

    Google Scholar 

  31. Xu K et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  32. Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) Draw: a recurrent neural network for image generation. ArXiv preprint arXiv:1502.04623

  33. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

  34. Vaswani A et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  35. Lin Z, Feng M, Santos CND, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. ArXiv preprint arXiv:1703.03130

  36. Wang F et al (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  37. Woo S, Hwang S, Kweon IS (2018) Stairnet: top-down semantic aggregation for accurate one shot detection. In: 2018 IEEE winter conference on applications of computer vision (WACV), pp 1093–1102

  38. Cheng J, Dong L, Lapata M (2016) Long short-term memory-networks for machine reading. ArXiv preprint arXiv:1601.06733

  39. Parikh AP, Täckström, O, Das D, Uszkoreit J (2016) A decomposable attention model for natural language inference. ArXiv preprint arXiv:1606.01933

  40. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. ArXiv preprint arXiv:1412.6980

  41. Johnson S, Everingham M (2010) Clustered pose and nonlinear appearance models for human pose estimation. BMVC 2(4):5

    Google Scholar 

  42. Wu J et al (2017) Ai challenger: a large-scale dataset for going deeper in image understanding. ArXiv preprint arXiv:1711.06475

  43. Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. In: Advances in neural information processing systems, pp 2277–2287

  44. Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K (2018) Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European conference on computer vision (ECCV), pp 269–286

  45. Kocabas M, Karagoz S, Akbas E (2018) Multiposenet: fast multi-person pose estimation using pose residual network. In: Proceedings of the European conference on computer vision (ECCV), pp 417–433

  46. Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4903–4911

  47. Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV), pp 529–545

  48. Fang HS, Xie S, Tai YW, Lu C (2017) “Rmpe: Regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp 2334–2343

  49. Huang S, Gong M, Tao D (2017) A coarse-fine network for keypoint localization. In: Proceedings of the IEEE international conference on computer vision, pp 3028–3037

  50. Bulat A, Tzimiropoulos G (2016) Human pose estimation via convolutional part heatmap regression. In: European conference on computer vision. Springer, Cham, pp 717–732

  51. Sun K, Lan C, Xing J, Zeng W, Liu D, Wang J (2017) Human pose estimation using global and local normalization. In: Proceedings of the IEEE international conference on computer vision, pp 5599–5607

  52. Tang Z, Peng X, Geng S, Wu L, Zhang S, Metaxas D (2018) Quantized densely connected u-nets for efficient landmark localization. In: Proceedings of the European conference on computer vision (ECCV), pp 339–354

  53. Ning G, Zhang Z, He Z (2017) Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans Multimed 20(5):1246–1259

    Article  Google Scholar 

  54. Luvizon DC, Tabia H, Picard D (2019) Human pose regression by combining indirect part detection and contextual information. Comput Graph 85:15–22

    Article  Google Scholar 

  55. Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X (2017) Multi-context attention for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1831–1840

  56. Available https://github.com/leoxiaobin/deep-high-resolution-net.pytorch

  57. Peng C, Zhang X, Yu G, Luo G, Sun J (2017) Large kernel matters–improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4353–4361

  58. Wang X, Bo L, Fuxin L (2019) Adaptive wing loss for robust face alignment via heatmap regression. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6971–6981

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61771299.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Tong, J. & Wang, R. Attention Refined Network for Human Pose Estimation. Neural Process Lett 53, 2853–2872 (2021). https://doi.org/10.1007/s11063-021-10523-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10523-9

Keywords

Navigation