Elsevier

Computers & Graphics

Volume 95, April 2021, Pages 115-122
Computers & Graphics

Special Section on SICNCG 2020
GMDN: A lightweight graph-based mixture density network for 3D human pose regression

https://doi.org/10.1016/j.cag.2021.01.010Get rights and content

Highlights

  • We formulate the 2D joint locations of the human body as a graph and present a novel lightweight graph convolutional operation with structural knowledge about human bodies.

  • Based on the proposed graph convolutional operation, a novel graph-based mixture density network (GMDN) is proposed to resolve the ambiguity and occlusion problem existing in 3D human pose estimation.

  • Comprehensive experiments on the Human3.6M dataset demonstrate that GMDN achieves state-of-the-art performance with only 0.30M parameters.

Abstract

3D human pose estimation from 2D detections is an ill-posed problem because multiple solutions may exist due to the inherent ambiguity and occlusion. In this paper, we propose a novel graph-based mixture density network (GMDN) to tackle the 2D-to-3D human pose estimation problem. We formulate the 2D joint locations of the human body as a graph, and thus the pose estimation task can be redefined as a graph regression problem. Additionally, we present a novel graph convolutional operation with the incorporation of structural knowledge about human body configurations to assist with reasoning of the structural relations implied in the human bodies. Furthermore, we employ mixture density networks to formulate the 3D human poses as a multimodal distribution. The presented GMDN is lightweight with only 0.30M parameters, and the experimental results demonstrate that it achieves state-of-the-art performance.

Introduction

3D human pose estimation is an active research field in computer vision, and it has numerous applications such as video surveillance, human-computer interaction and sports analysis. Due to the success of deep neural networks and large-scale motion capture datasets, tremendous progress has been made in 3D human pose estimation in recent years.

Current methods usually focus on directly estimating 3D joint locations from a single RGB image or estimating 3D joint locations from intermediate 2D predictions. We follow the second approach to address the 3D human pose estimation problem because there are large-scale in-the-wild 2D pose datasets, and thus, the 2D pose estimation networks such as [1], [2] can provide high quality 2D predictions which benefits the 3D pose estimation stage.

Recently, numerous works [3], [4], [5], [6], [7] have focused on the 2D-to-3D lifting problem. In general, most of them regress the 3D locations of the joints by minimizing the mean square loss between the predicted single pose and the ground truth. However, due to the weak back-projection from 2D to 3D, recovering 3D information from 2D joint locations remains a challenging problem. More specifically, it is an inverse problem where multiple valid 3D solutions corresponding to a single 2D pose may exist, thus making it difficult to infer a unique valid solution, especially for cases with severe occlusions and pose variations. What’s more, existing deep learning based pose estimation models usually consume high computing resources due to their complicated network structures, which is not suitable for deployment on devices with limited computing resources such as smartphones or embedded systems.

In this paper, we introduce a lightweight graph-based mixture density network (GMDN) to resolve the above problems. The 2D joint locations are formulated as a graph (as shown in Fig. 1), and a mixture density network (MDN) is utilized to recover the human body structures by learning the structural joint relations that exist in the human skeleton. By contrast with previous graph based methods, we incorporate structural knowledge about human body configurations into the graph to enhance reasoning about the human structures. Additionally, the 3D human poses are regressed as a Gaussian mixture distribution which generates multiple valid solutions over each body joint. We train the network with an improved MDN loss function [9] to avoid collapse of the model into a unimodal Gaussian distribution. The effectiveness of our approach is validated by comprehensive experiments on the Human3.6M [8] dataset, with comparisons to existing state-of-the-art methods. As the results demonstrate, our approach achieves state-of-the-art performance with only 0.30M parameters, which is 89.7% fewer compared to previous work [10]. We also provide an ablation study of our network to test the contribution of each component. Furthermore, we show the visualization results of our method on both Human3.6M [8] dataset and in-the-wild datasets [11], [12], which qualitatively demonstrate the effectiveness of our approach.

Our main contributions are as follows:

  • We formulate the 2D joint locations of the human body as a graph and present a novel lightweight graph convolutional operation with structural knowledge about human bodies.

  • Based on the proposed graph convolutional operation, a novel graph-based mixture density network (GMDN) is proposed to resolve the ambiguity and occlusion problems existing in 3D human pose estimation.

The remainder of this paper is organized in the following manner. Section 2 briefly reviews some related works about 3D human pose estimation. Section 3 elaborates the overall framework of our network. The experimental results of our network with comparisons to existing methods are reported in Section 4. Finally, the conclusions and future directions are provided in Section 5.

Section snippets

3D human pose estimation

3D human pose estimation methods can be basically divided into two categories: end-to-end methods and two-stage methods. End-to-end methods directly regress 3D poses from RGB images. Mehta et al. [13] first employed fully convolutional networks for 3D human poses. Later on, Pavlakos et al. [14] presented a volumetric representation to estimate the 3D poses in a coarse-to-fine manner. To predict 3D poses from images in the wild, geometric loss was proposed in [15] to allow weakly supervised

Method

Our overall framework for estimating 3D human poses from 2D pose detections is shown in Fig. 2. As the figure shows, the input of GMDN is an array of 2D joint locations with a predefined adjacency matrix, while the output of GMDN is the parameters of mixture density network. GMDN consists of a Feature Extractor and a Hypotheses Generator. Firstly, the Feature Extractor extracts point-wise feature embeddings over each body joint on the graph; then, the Hypotheses Generator processes the

Experiments

We first describe the implementation details of GMDN (Section 4.1). Then, the datasets and protocols used for training and evaluating GMDN are introduced (Section 4.2). Next, we show our results on the Human3.6M [8] dataset and compare them with those of state-of-the-art methods(Section 4.3). In this situation, the generalization results of GMDN on in-the-wild images are also reported. Finally, ablation studies are conducted to validate the effectiveness of the different component contributing

Conclusion

In this paper, we present a graph-based mixture density network to resolve the problem of 3D human pose estimation. For reasoning about the structural information encoded in holistic human bodies, a graph convolutional operation incorporating structural knowledge about human body configurations is proposed. Additionally, each body joint is modeled as a multimodal Gaussian distribution that generates multiple reasonable solutions. Experimental results demonstrate that the presented GMDN is

CRediT authorship contribution statement

Lu Zou: Conceptualization, Methodology, Software, Writing - original draft. Zhangjin Huang: Writing - original draft, Supervision. Naijie Gu: Investigation, Supervision. Fangjun Wang: Data curation, Visualization. Zhouwang Yang: Validation, Writing - review & editing. Guoping Wang: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Nos. 71991464 / 71991460, and 61877056), and the Fundamental Research Funds for the Central Universities (Nos. WK6030000109 and WK5290000001).

References (30)

  • Y. Chen et al.

    Cascaded pyramid network for multi-person pose estimation

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2018)
  • A. Newell et al.

    Stacked hourglass networks for human pose estimation

    Proceedings of the European conference on computer vision

    (2016)
  • J. Martinez et al.

    A simple yet effective baseline for 3D human pose estimation

    Proceedings of the IEEE international conference on computer vision

    (2017)
  • K. Lee et al.

    Propagating LSTM: 3D pose estimation based on joint interdependency

    Proceedings of the European conference on computer vision (ECCV)

    (2018)
  • H.-S. Fang et al.

    Learning pose grammar to encode human body configuration for 3D pose estimation

    Proceedings of the thirty-second AAAI conference on artificial intelligence

    (2018)
  • G. Pavlakos et al.

    Ordinal depth supervision for 3d human pose estimation

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2018)
  • W. Yang et al.

    3D human pose estimation in the wild by adversarial learning

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2018)
  • C. Ionescu et al.

    Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • C. Li et al.

    Generating multiple hypotheses for 3D human pose estimation with mixture density network

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2019)
  • H. Ci et al.

    Optimizing network structure for 3D human pose estimation

    Proceedings of the IEEE international conference on computer vision

    (2019)
  • D. Mehta et al.

    Monocular 3D human pose estimation in the wild using improved CNN supervision

    Proceedings of the international conference on 3D vision (3DV)

    (2017)
  • M. Andriluka et al.

    2D human pose estimation: new benchmark and state of the art analysis

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2014)
  • D. Mehta et al.

    Vnect: real-time 3D human pose estimation with a single RGB camera

    ACM Trans. Graph. (TOG)

    (2017)
  • G. Pavlakos et al.

    Coarse-to-fine volumetric prediction for single-image 3d human pose

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2017)
  • X. Zhou et al.

    Towards 3D human pose estimation in the wild: a weakly-supervised approach

    Proceedings of the IEEE international conference on computer vision

    (2017)
  • Cited by (0)

    View full text