Special Section on SICNCG 2020GMDN: A lightweight graph-based mixture density network for 3D human pose regression
Graphical abstract
Introduction
3D human pose estimation is an active research field in computer vision, and it has numerous applications such as video surveillance, human-computer interaction and sports analysis. Due to the success of deep neural networks and large-scale motion capture datasets, tremendous progress has been made in 3D human pose estimation in recent years.
Current methods usually focus on directly estimating 3D joint locations from a single RGB image or estimating 3D joint locations from intermediate 2D predictions. We follow the second approach to address the 3D human pose estimation problem because there are large-scale in-the-wild 2D pose datasets, and thus, the 2D pose estimation networks such as [1], [2] can provide high quality 2D predictions which benefits the 3D pose estimation stage.
Recently, numerous works [3], [4], [5], [6], [7] have focused on the 2D-to-3D lifting problem. In general, most of them regress the 3D locations of the joints by minimizing the mean square loss between the predicted single pose and the ground truth. However, due to the weak back-projection from 2D to 3D, recovering 3D information from 2D joint locations remains a challenging problem. More specifically, it is an inverse problem where multiple valid 3D solutions corresponding to a single 2D pose may exist, thus making it difficult to infer a unique valid solution, especially for cases with severe occlusions and pose variations. What’s more, existing deep learning based pose estimation models usually consume high computing resources due to their complicated network structures, which is not suitable for deployment on devices with limited computing resources such as smartphones or embedded systems.
In this paper, we introduce a lightweight graph-based mixture density network (GMDN) to resolve the above problems. The 2D joint locations are formulated as a graph (as shown in Fig. 1), and a mixture density network (MDN) is utilized to recover the human body structures by learning the structural joint relations that exist in the human skeleton. By contrast with previous graph based methods, we incorporate structural knowledge about human body configurations into the graph to enhance reasoning about the human structures. Additionally, the 3D human poses are regressed as a Gaussian mixture distribution which generates multiple valid solutions over each body joint. We train the network with an improved MDN loss function [9] to avoid collapse of the model into a unimodal Gaussian distribution. The effectiveness of our approach is validated by comprehensive experiments on the Human3.6M [8] dataset, with comparisons to existing state-of-the-art methods. As the results demonstrate, our approach achieves state-of-the-art performance with only 0.30M parameters, which is 89.7% fewer compared to previous work [10]. We also provide an ablation study of our network to test the contribution of each component. Furthermore, we show the visualization results of our method on both Human3.6M [8] dataset and in-the-wild datasets [11], [12], which qualitatively demonstrate the effectiveness of our approach.
Our main contributions are as follows:
- •
We formulate the 2D joint locations of the human body as a graph and present a novel lightweight graph convolutional operation with structural knowledge about human bodies.
- •
Based on the proposed graph convolutional operation, a novel graph-based mixture density network (GMDN) is proposed to resolve the ambiguity and occlusion problems existing in 3D human pose estimation.
The remainder of this paper is organized in the following manner. Section 2 briefly reviews some related works about 3D human pose estimation. Section 3 elaborates the overall framework of our network. The experimental results of our network with comparisons to existing methods are reported in Section 4. Finally, the conclusions and future directions are provided in Section 5.
Section snippets
3D human pose estimation
3D human pose estimation methods can be basically divided into two categories: end-to-end methods and two-stage methods. End-to-end methods directly regress 3D poses from RGB images. Mehta et al. [13] first employed fully convolutional networks for 3D human poses. Later on, Pavlakos et al. [14] presented a volumetric representation to estimate the 3D poses in a coarse-to-fine manner. To predict 3D poses from images in the wild, geometric loss was proposed in [15] to allow weakly supervised
Method
Our overall framework for estimating 3D human poses from 2D pose detections is shown in Fig. 2. As the figure shows, the input of GMDN is an array of 2D joint locations with a predefined adjacency matrix, while the output of GMDN is the parameters of mixture density network. GMDN consists of a Feature Extractor and a Hypotheses Generator. Firstly, the Feature Extractor extracts point-wise feature embeddings over each body joint on the graph; then, the Hypotheses Generator processes the
Experiments
We first describe the implementation details of GMDN (Section 4.1). Then, the datasets and protocols used for training and evaluating GMDN are introduced (Section 4.2). Next, we show our results on the Human3.6M [8] dataset and compare them with those of state-of-the-art methods(Section 4.3). In this situation, the generalization results of GMDN on in-the-wild images are also reported. Finally, ablation studies are conducted to validate the effectiveness of the different component contributing
Conclusion
In this paper, we present a graph-based mixture density network to resolve the problem of 3D human pose estimation. For reasoning about the structural information encoded in holistic human bodies, a graph convolutional operation incorporating structural knowledge about human body configurations is proposed. Additionally, each body joint is modeled as a multimodal Gaussian distribution that generates multiple reasonable solutions. Experimental results demonstrate that the presented GMDN is
CRediT authorship contribution statement
Lu Zou: Conceptualization, Methodology, Software, Writing - original draft. Zhangjin Huang: Writing - original draft, Supervision. Naijie Gu: Investigation, Supervision. Fangjun Wang: Data curation, Visualization. Zhouwang Yang: Validation, Writing - review & editing. Guoping Wang: Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (Nos. 71991464 / 71991460, and 61877056), and the Fundamental Research Funds for the Central Universities (Nos. WK6030000109 and WK5290000001).
References (30)
- et al.
Cascaded pyramid network for multi-person pose estimation
Proceedings of the IEEE conference on computer vision and pattern recognition
(2018) - et al.
Stacked hourglass networks for human pose estimation
Proceedings of the European conference on computer vision
(2016) - et al.
A simple yet effective baseline for 3D human pose estimation
Proceedings of the IEEE international conference on computer vision
(2017) - et al.
Propagating LSTM: 3D pose estimation based on joint interdependency
Proceedings of the European conference on computer vision (ECCV)
(2018) - et al.
Learning pose grammar to encode human body configuration for 3D pose estimation
Proceedings of the thirty-second AAAI conference on artificial intelligence
(2018) - et al.
Ordinal depth supervision for 3d human pose estimation
Proceedings of the IEEE conference on computer vision and pattern recognition
(2018) - et al.
3D human pose estimation in the wild by adversarial learning
Proceedings of the IEEE conference on computer vision and pattern recognition
(2018) - et al.
Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments
IEEE Trans. Pattern Anal. Mach. Intell.
(2013) - et al.
Generating multiple hypotheses for 3D human pose estimation with mixture density network
Proceedings of the IEEE conference on computer vision and pattern recognition
(2019) - et al.
Optimizing network structure for 3D human pose estimation
Proceedings of the IEEE international conference on computer vision
(2019)