Skip to main content
Log in

AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Occlusion is probably the biggest challenge for human pose estimation in the wild. Typical solutions often rely on intrusive sensors such as IMUs to detect occluded joints. To make the task truly unconstrained, we present AdaFuse, an adaptive multiview fusion method, which can enhance the features in occluded views by leveraging those in visible views. The core of AdaFuse is to determine the point-point correspondence between two views which we solve effectively by exploring the sparsity of the heatmap representation. We also learn an adaptive fusion weight for each camera view to reflect its feature quality in order to reduce the chance that good features are undesirably corrupted by “bad” views. The fusion model is trained end-to-end with the pose estimation network, and can be directly applied to new camera configurations without additional adaptation. We extensively evaluate the approach on three public datasets including Human3.6M, Total Capture and CMU Panoptic. It outperforms the state-of-the-arts on all of them. We also create a large scale synthetic dataset Occlusion-Person, which allows us to perform numerical evaluation on the occluded joints, as it provides occlusion labels for every joint in the images. The dataset and code are released at https://github.com/zhezh/adafuse-3d-human-pose.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Amin, S., Andriluka, M., Rohrbach, M., & Schiele, B. (2013). Multi-view pictorial structures for 3D human pose estimation. In BMVC.

  • Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In CVPR (pp. 3686–3693).

  • Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., & Ilic, S. (2014). 3d pictorial structures for multiple human pose estimation. In CVPR (pp. 1669–1676).

  • Bo, L., & Sminchisescu, C. (2010). Twin gaussian processes for structured prediction. IJCV, 87(1–2), 28.

    Article  Google Scholar 

  • Bridgeman, L., Volino, M., Guillemaut, J. Y., & Hilton, A. (2019). Multi-person 3d pose estimation and tracking in sports. In CVPRW.

  • Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3D pictorial structures for multiple view articulated pose estimation. In CVPR (pp. 3618–3625).

  • Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In CVPR (pp. 7291–7299).

  • Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., et al. (2016). Synthesizing training images for boosting human 3d pose estimation. In 3DV (pp. 479–488). IEEE.

  • Cheng, Y., Yang, B., Wang, B., Yan, W., & Tan, R. T. (2019). Occlusion-aware networks for 3d human pose estimation in video. In ICCV (pp. 723–732).

  • Ci, H., Wang, C., Ma, X., & Wang, Y. (2019). Optimizing network structure for 3d human pose estimation. In ICCV (pp. 915–922).

  • Ci, H., Ma, X., Wang, C., & Wang, Y. (2020). Locally connected network for monocular 3d human pose estimation. In T-PAMI.

  • Dong, J., Jiang, W., Huang, Q., Bao, H., & Zhou, X. (2019). Fast and robust multi-person 3d pose estimation from multiple views. In CVPR (pp. 7792–7801).

  • Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.

    Article  MathSciNet  Google Scholar 

  • Gal, Y. (2016). Uncertainty in deep learning. PhD thesis, PhD thesis, University of Cambridge.

  • Gal, Y., & Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Insights and applications. In Deep learning workshop (Vol. 1, p. 2). ICML.

  • Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010). Optimization and filtering for human motion capture. IJCV, 87(1–2), 75.

    Article  Google Scholar 

  • Ghahramani, Z. (2016). A history of Bayesian neural networks. In NIPS workshop on Bayesian deep learning.

  • Gilbert, A., Trumble, M., Malleson, C., Hilton, A., & Collomosse, J. (2019). Fusing visual and inertial sensors with semantics for 3d human pose estimation. IJCV, 127(4), 381–397.

    Article  Google Scholar 

  • Guo, C., Pleiss, G., Sun, Y., Weinberger, K. Q. (2017). On calibration of modern neural networks. In ICML (pp. 1321–1330), JMLR.org .

  • Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • He, Y., Zhu, C., Wang, J., Savvides, M., & Zhang, X. (2019). Bounding box regression with uncertainty for accurate object detection. In CVPR (pp. 2888–2897).

  • Hoffmann, D. T., Tzionas, D., Black, M. J., & Tang, S. (2019). Learning to train with synthetic humans. In German conference on pattern recognition (pp. 609–623). Springer.

  • Ilg, E., Cicek, O., Galesso, S., Klein, A., Makansi, O., Hutter, F., et al. (2018). Uncertainty estimates and multi-hypotheses networks for optical flow. In ECCV (pp. 652–667).

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1325–1339.

    Article  Google Scholar 

  • Iskakov, K., Burkov, E., Lempitsky, V., & Malkov, Y. (2019). Learnable triangulation of human pose. arXiv preprint arXiv:1905.05754.

  • Joo, H., Simon, T., Li, X., Liu, H., Tan, L., Gui, L., et al. (2019). Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1), 190–204.

    Article  Google Scholar 

  • Kendall, A., & Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? In NIPS (pp. 5574–5584).

  • Kreiss, S., Bertoni, L., & Alahi, A. (2019). Pifpaf: Composite fields for human pose estimation. In CVPR (pp. 11977–11986).

  • Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS (pp. 6402–6413).

  • Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M. J., & Gehler, P. V. (2017). Unite the people: Closing the loop between 3d and 2d human representations. In CVPR (pp. 6050–6059).

  • Li, T., Fan, L., Zhao, M., Liu, Y., & Katabi, D. (2019). Making the invisible visible: Action recognition through walls and occlusions. In ICCV (pp. 872–881).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755). Springer.

  • Liu, Y., Stoll, C., Gall, J., Seidel, H. P., & Theobalt, C. (2011). Markerless motion capture of interacting characters using multi-view image segmentation. In CVPR (pp. 1249–1256). IEEE.

  • Malleson, C., Gilbert, A., Trumble, M., Collomosse, J., Hilton, A., & Volino, M. (2017). Real-time full-body motion capture from video and imus. In 3DV (pp. 449–457). IEEE.

  • von Marcard, T., Henschel, R., Black, MJ., Rosenhahn, B., & Pons-Moll, G. (2018). Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV (pp. 601–617).

  • Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3D human pose estimation. In ICCV (p. 5).

  • Moeslund, T. B., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2–3), 90–126.

    Article  Google Scholar 

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV (pp. 483–499). Springer.

  • Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Harvesting multiple views for marker-less 3D human pose annotations. In: CVPR (pp. 1253–1262).

  • Pavlakos, G., Zhou, X., & Daniilidis, K. (2018). Ordinal depth supervision for 3d human pose estimation. In CVPR (pp. 7307–7316).

  • Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3d human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR (pp. 7753–7762).

  • Peng, X., Tang, Z., Yang, F., Feris, R. S., & Metaxas, D. (2018). Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In CVPR (pp. 2226–2234).

  • Perez, P., Vermaak, J., & Blake, A. (2004). Data fusion for visual tracking with particles. Proceedings of the IEEE, 92(3), 495–513.

    Article  Google Scholar 

  • Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). On fairness and calibration. In NIPS (pp. 5680–5689).

  • Qiu, H., Wang, C., Wang, J., Wang, N., & Zeng, W. (2019). Cross view fusion for 3d human pose estimation. In ICCV (pp. 4342–4351).

  • Qiu, W., Zhong, F., Zhang, Y., Qiao, S., Xiao, Z., Kim, T. S., et al. (2017). Unrealcv: Virtual worlds for computer vision. In Proceedings of the 25th ACM international conference on multimedia (pp. 1221–1224 ).ACM.

  • Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., et al. (2018). Learning monocular 3d human pose estimation from multi-view images. In CVPR (pp. 8437–8446).

  • Roetenberg, D., Luinge, H., & Slycke, P. (2009). Xsens mvn: full 6dof human motion tracking using miniature inertial sensors. Xsens Motion Technologies BV, Tech Rep 1.

  • Rogez, G., Schmid, C. (2016). Mocap-guided data augmentation for 3d pose estimation in the wild. In NIPS (pp. 3108–3116).

  • Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1–2), 4.

    Article  Google Scholar 

  • Starner, T., Leibe, B., Minnen, D., Westyn, T., Hurst, A., & Weeks, J. (2003). The perceptive workbench: Computer-vision-based gesture tracking, object tracking, and 3d reconstruction for augmented desks. Machine Vision and Applications, 14(1), 59–71.

    Article  Google Scholar 

  • Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In CVPR (pp. 5693–5703).

  • Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In ECCV (pp. 529–545).

  • Tome, D., Toso, M., Agapito, L., & Russell, C. (2018). Rethinking pose in 3D: Multi-stage refinement and recovery for markerless motion capture. In 3DV (pp. 474–483).

  • Trumble, M., Gilbert, A., Malleson, C., Hilton, A., & Collomosse, J. (2017). Total capture: 3D human pose estimation fusing video and inertial sensors. In BMVC (pp. 1–13).

  • Trumble, M., Gilbert, A., Hilton, A., & Collomosse, J. (2018). Deep autoencoder for combined human pose estimation and body model upscaling. In ECCV (pp. 784–800).

  • Tu, H., Wang, C., & Zeng, W. (2020). Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In ECCV (pp. 1–16).

  • Varol, G., Romero, J., Martin, X., Mahmood, N., Black, MJ., Laptev, I., et al. (2017). Learning from synthetic humans. In CVPR (pp. 109–117).

  • Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR (pp. 4724–4732).

  • Xiang, D., Joo, H., & Sheikh, Y. (2019). Monocular total capture: Posing face, body, and hands in the wild. In CVPR.

  • Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In ECCV (pp. 466–481).

  • Xie, R., Wang, C., & Wang, C. (2020). Metafuse: A pre-trained fusion model for human pose estimation. In CVPR.

  • Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., & Wang, X. (2018). 3d human pose estimation in the wild by adversarial learning. In CVPR (pp. 5255–5264).

  • Zafar, U., Ghafoor, M., Zia, T., Ahmed, G., Latif, A., Malik, K. R., et al. (2019). Face recognition with Bayesian convolutional networks for robust surveillance systems. EURASIP Journal on Image and Video Processing, 1, 10.

    Article  Google Scholar 

  • Zhang, Z., Wang, C., Qin, W., & Zeng, W. (2020). Fusing wearable imus with multi-view images for human pose estimation: A geometric approach. In CVPR (pp. 2200–2209).

  • Zhao, M., Li, T., Abu Alsheikh, M., Tian, Y., Zhao, H., Torralba, A., et al. (2018). Through-wall human pose estimation using radio signals. In CVPR (pp. 7356–7365).

  • Zhao, M., Liu, Y., Raghu, A., Li, T., Zhao, H., Torralba, A., et al. (2019). Through-wall human mesh recovery using radio signals. In ICCV (pp. 10113–10122).

  • Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017). Towards 3D human pose estimation in the wild: A weakly-supervised approach. In ICCV (pp. 398–407).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenhu Qin.

Additional information

Communicated by Mei Chen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work done when Zhe Zhang is an intern at Microsoft Research Asia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Wang, C., Qiu, W. et al. AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild. Int J Comput Vis 129, 703–718 (2021). https://doi.org/10.1007/s11263-020-01398-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01398-9

Keywords

Navigation