Spatial-aware stacked regression network for real-time 3D hand pose estimation
Introduction
Hand pose estimation plays an important role in applications of human–computer interaction in virtual reality and augmented reality. With the development of deep learning and low-cost depth camera, it has made significant progress in recent years [1], [2], [3]. Nevertheless, it is still challenging to achieve accurate and robust hand pose estimation performance due to large variations in hand orientations, high similarity among fingers, severe self-occlusion and poor quality of depth images. In addition, improving the speed of the algorithm is also an important issue in order to satisfy the requirements of interactive application scenarios.
Recently, deep neural network-based approaches have achieved drastic performance improvement in 3D hand pose estimation. One line of work for 3D hand poses estimation is holistic regression [4], [5], [6], [7], [8], [9], that is aiming to directly predict 3D pose parameters such as joint angles or 3D joint locations from the depth image. Regression-based methods are able to capture global constraints among different joints, thus they are robust to self-occlusion and poor quality images [2]. However, since these methods treat the depth image as a 2D image, they under-utilize 3D spatial information of the depth image, thus having relatively low precision.
Explicit consideration of 3D spatial properties of the depth image can significantly improve the accuracy of estimation [2]. One straightforward solution is converting the depth image to 3D data, such as voxels [10], [11] and points [12], [13], [14], [15], [16], and then applying a 3D deep learning method. An alternative way is to incorporate spatial-aware representations into 2D CNNs. These works utilize the fully convolutional network to estimate pixel-wise representations such as 3D heat-maps, 3D unit vector fields and approximated geodesic distance maps, from which the 3D hand joint locations can be inferred by post-processing [17], [18], [19] or a regression network [20]. Adopting spatial-aware representations allows 2D CNNs to consider both the 2D and 3D properties of the depth image and makes it easy to leverage stacked architecture to reevaluate the initial estimations. However, those methods still suffer from some limitations.
First, performing pixel-wise estimation is inefficient, because it needs a computationally expensive upsampling step to obtain high-resolution pixel-wise estimations [17], [19], [21]. Thus, compared with regression-based methods, these methods require more complex network structures and more computation. Second, when the depth data near the target joint points are missing or occluded, directly inferring the joint coordinates from the spatial-aware representations is unreliable [13].
In essence, the spatial-aware representations are embeddings of low-dimensional joint coordinates on high-dimensional image space. They reflect the spatial location of joints or the spatial relationship such as distance, direction and offset between the joint coordinates and each pixel of the depth image. They are able to provide rich 3D spatial information and powerful disambiguation clues for further optimization. We argue that the form of representation is more critical than the process of representation generation for iterative refinement. Based on this inspiration, we propose a stacked regression architecture, which directly encodes the previously estimated pose into spatial-aware representations by a pose re-parameterization, thus subsequent regression stages are able to utilize multi-joint spatial context and the 3D spatial relationship between the estimated pose and the depth data to perform iterative refinement.
In this paper, we propose a Spatial-aware Stacked Regression Network (SSRN) for fast, robust and accurate 3D hand pose estimation from a single depth image. SSRN has multiple pose regression modules, which are connected by the differentiable pose re-parameterization module. Specifically, the pose re-parameterization module generates spatial-aware representations from previously estimated pose directly. Then, the subsequent pose regression module predicts a more accurate pose based on multi-joint spatial context and the 3D spatial relationship between the estimated pose and the depth data on the spatial-aware representations. We regard the first pose regression module as the initial stage and the subsequent regression module as the refinement stage. The pose re-parameterization process is simple, fast and non-parametric. It can generate high-quality spatial-aware representations with little overhead in computation and storage. Furthermore, we integrate and explore multiple good practices including data augmentation, smooth L1 loss, localization refinement and coordinate decoupling to improve the performance of 2D CNN for 3D hand pose estimation.
Our main contributions can be summarized as follows:
(1) We adopt a differentiable pose re-parameterization process to generate spatial-aware representations. Compared with performing pixel-wise estimation, it can generate high-quality spatial-aware representations with little overhead in computation and storage.
(2) We incorporate the spatial-aware representations into a stacked regression network. Spatial-aware representations allow us to efficiently utilize the 3D spatial information in the depth image and multi-joint spatial context for accurate 3D hand pose estimation.
This paper is an extension of our conference paper [22]. The new contributions of this paper are summarized as follows:
(1) We propose a spatial attention mechanism to replace the global average pooling (GAP) and fully connected (FC) layer in the regression-based method to reduce the influence of irrelevant regions in the feature maps for pose estimation. Experimental results show that, with the spatial attention mechanism, the estimation accuracy of the regression-based method can be further improved.
(2) We propose a cross-stage self-distillation mechanism to align the features of the initial stage with the refinement stage, which allows us to adopt a more lightweight network in the initial stage while maintaining the estimation accuracy so that the whole network has faster reasoning speed.
(3) We conduct more extensive self-comparison experiments and a cross-dataset experiment. Experimental results show that our method has good generalization ability.
We evaluate our method on four publicly available 3D hand pose datasets (NYU [1], ICVL [23], MSRA [24], HANDS 2017 [2]). Our method achieves state-of-the-art accuracy on four datasets with fewer parameters and a faster frame rate. The reasoning speed of our method is around 330 FPS on a single GPU and 35 FPS on a single CPU.
Section snippets
Depth-based 3D hand pose estimation
The methods of estimating 3D hand pose from a single depth image can be categorized into three classes: generative methods, discriminative methods and hybrid methods. Generative methods [25], [26], [27], [28], [29], [30], [31], [32], [33] use a pre-defined 3D hand model to fit depth input. Its effectiveness heavily relies on the construction of the hand model and the definition of the energy function. These approaches need a time-consuming optimizing procedure and are likely to trap into local
Overview
In order to better capture the spatial structure of the depth data, our method aims at incorporating spatial-aware representations into the network inference. However, we also want to avoid performing pixel-wise estimation due to the computational overhead and lack of robustness. To that end, we adopt a pose re-parameterization process to encode an estimated pose into spatial-aware representations directly. Fig. 1 illustrates the architecture of our method. SSRN contains a feature extraction
Dataset and evaluation metric
We conduct experiments on four publicly available datasets: NYU dataset [1], ICVL dataset [23], MSRA dataset [24] and HANDS 2017 dataset [2].
NYU dataset [1] consists of 72 K training and 8.2 K testing depth images captured by the PrimeSense 3D sensor. The annotation of hand pose contains 36 joints. Following previous works [6], [11], we selected 14 joints from all annotated joints during training and testing.
ICVL dataset [23] consists of 330 K training and 1.6 K testing depth images captured by
Conclusion
In this paper, we propose a Spatial-aware Stacked Regression Network (SSRN) for fast, robust and accurate 3D hand pose estimation from a single depth image. We utilize a differentiable pose re-parameterization module to efficiently generate a high-quality spatial-aware representation from the previously estimated pose. Taking such representations as input, a pose regression module allows the SSRN to utilize the 3D spatial structure of depth data and multi-joint spatial context to reevaluate the
CRediT authorship contribution statement
Pengfei Ren: Conceptualization, Methodology, Software, Writing - original draft, Investigation, Validation, Data curation, Formal analysis, Writing - review & editing. Haifeng Sun: Resources, Supervision, Project administration, Funding acquisition. Weiting Huang: Investigation, Validation, Writing - review & editing. Jiachang Hao: Resources, Formal analysis, Writing - review & editing. Daixuan Cheng: Writing - review & editing. Qi Qi: Resources, Supervision, Project administration, Funding
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China under Grants 61671079, 61771068, and in part by the Beijing Municipal Natural Science Foundation under Grant 4182041. This work was also supported by BUPT Excellent Ph.D. Students Foundation CX2020121.
Pengfei Ren is obtained his B.S. degree from Beijing University of Posts and Telecommunications in 2018, where he is currently working toward his M.S. degree. His research interest includes deep learning, hand pose estimation and gesture recognition.
References (80)
- et al.
A survey on 3D hand pose estimation: Cameras, methods, and datasets
Pattern Recogn.
(2019) - J. Tompson, M. Stein, Y. Lecun, K. Perlin, Real-time continuous pose recovery of human hands using convolutional...
- et al.
Depth-based 3d hand pose estimation: From current achievements to future goals
- M. Oberweger, P. Wohlhart, V. Lepetit, Hands deep in deep learning for hand pose estimation, in: Proceedings of the...
- et al.
Region ensemble network: Improving convolutional network for hand pose estimation
- X. Chen, G. Wang, H. Guo, C. Zhang, Pose guided structured region ensemble network for cascaded hand pose estimation,...
- M. Oberweger, V. Lepetit, Deepprior++: Improving fast and accurate 3d hand pose estimation, in: Proceedings of the IEEE...
- M. Madadi, S. Escalera, X. Baró, J. Gonzalez, End-to-end global to local cnn learning for hand pose recovery in depth...
- M. Oberweger, P. Wohlhart, V. Lepetit, Training a Feedback Loop for Hand Pose Estimation, in: Proceedings of the IEEE...
- L. Ge, H. Liang, J. Yuan, D. Thalmann, 3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation...
V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation From a Single Depth Map
Point-to-pose voting based hand pose estimation using residual permutation equivariant layer
So-handnet: Self-organizing network for 3d hand pose estimation with semi-supervised learning
Shpr-net: Deep semantic hand pose regression from point clouds
IEEE Access
Dense 3d regression for hand pose estimation
A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image
AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation
Handmap: Robust hand pose estimation via intermediate dense guidance map supervision
Simple baselines for human pose estimation and tracking
Latent regression forest: Structured estimation of 3d articulated hand posture
Cascaded hand pose regression
Realtime and robust hand tracking from depth
Learning an efficient model of hand shape variation from depth images
Fast and robust hand tracking using detection-guided optimization
Robust articulated-ICP for real-time hand tracking
Computer Graphics Forum
Motion capture of hands in action using discriminative salient points
Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera
Fast pose estimation with parameter-sensitive hashing
Latent regression forest: structured estimation of 3d hand poses
IEEE Trans. Pattern Anal. Mach. Intell.
Accurate, robust, and flexible real-time hand tracking
Real-time joint tracking of a hand manipulating an object from rgb-d input
Generative Model-Based Loss to the Rescue: A Method to Overcome Annotation Errors for Depth-Based Hand Pose Estimation
Cited by (20)
A hand motion capture method based on infrared thermography for measuring fine motor skills in biomedicine
2023, Artificial Intelligence in MedicineCitation Excerpt :Hand pose estimation (HPE) aims to localizing complex hand anatomical keypoints from images or video sequences as a fundamental motion capture technology in computer vision and has been used in many fields, such as human–computer interaction, virtual reality, movies and animation, sports motion analysis [1–5].
Dual-channel cascade pose estimation network trained on infrared thermal image and groundtruth annotation for real-time gait measurement
2022, Medical Image AnalysisCitation Excerpt :This challenge remains even with deep learning approaches. Likewise, depth based pose estimation and analysis via deep neural networks and the subsequent production of large human pose estimation datasets has also attracted considerable attention over the last decade (He et al., 2015; He et al., 2016; Vasileiadis et al., 2019; Zhang et al., 2020; Ren et al., 2021). Various efforts have been made in the domain of gait analysis using depth-based human pose estimation (Eltoukhy et al., 2017b; Latorre et al., 2018; Summa et al., 2020; Hazra et al., 2021).
3D interacting hand pose and shape estimation from a single RGB image
2022, NeurocomputingCitation Excerpt :With the rapid development of artificial intelligence, hand reconstruction techniques have exhibited widespread prospects for application and great commercial values in human-computer interaction services, e.g., virtual control, AR/VR assistance, to name a few. Many hand reconstruction methods estimate 3D hand pose from depth image [1–6], which usually need the extra support of RGB-D sensor [7] and result in the extra cost of deployment consequently. In practice, estimating hand pose and shape directly from a single RGB image is more applicable and attracts much attention in recent research [8–19].
Multi-virtual View Scoring Network for 3D Hand Pose Estimation from a Single Depth Image
2024, Communications in Computer and Information ScienceTwo Heads Are Better than One: Image-Point Cloud Network for Depth-Based 3D Hand Pose Estimation
2023, Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023
Pengfei Ren is obtained his B.S. degree from Beijing University of Posts and Telecommunications in 2018, where he is currently working toward his M.S. degree. His research interest includes deep learning, hand pose estimation and gesture recognition.
Haifeng Sun is obtained his PhD degree from Beijing University of Posts and Telecommunications in 2017. Now he is an lecture in Beijing University of Posts and Telecommunications. His research interest includes data mining, information retrieval, and Next Generation Network.
Weiting Huang is now a graduate student in Beijing University of Posts and Telecommunications. Her research interest is computer vision.
Jiachang Hao is obtained his Bachelor’s degree from Beijing University of Posts and Telecommunications in 2018. Now he is postgraduate in Beijing University of Posts and Telecommunications. His research interest includes video understanding, action recognition and object detection
Daixuan Cheng is a bachelor student under the supervision of Prof. Jingyu Wang in the Network Intelligence Research Center at Beijing University of Posts and Telecommunications. She worked on various projects with his supervisor Prof. Haifeng Sun. She is interested in natural language processing, computer vision, machine learning and data mining.
Qi Qi received the Ph.D. degree from the Beijing University of Posts and Telecommunications, in 2010, where she is currently an Associate Professor with the State Key Laboratory of Networking and Switching Technology. She has published more than 30 papers in international journal, and received two National Natural Science Foundations of China. Her research interests include ubiquitous services, deep learning, transfer learning, deep reinforcement learning, edge computing, and the Internet of Things.
Jingyu Wang was born in 1978, obtained his Ph.D degree from Beijing University of Posts and Telecommunications in 2008. Now he is the full professor of State Key laboratory of Networking and Switching Technology in Beijing University of Posts and Telecommunications, China. His research interests span broad aspects of future internet, intelligent networks, machine learning and date mining. He has published hundreds of research papers and several books, and has been granted dozens of patents for inventions.
Jianxin Liao obtained his PhD degree at University of Electronics Science and Technology of China in 1996. He is currently the dean of Network Intelligence Research Center and the full professor of State Key laboratory of Networking and Switching Technology in Beijing University of Posts and Telecommunications. He has published hundreds of research papers and several books, and has been granted dozens of patents for inventions. He has won a number of prizes, which include the Premier’s Award of Distinguished Young Scientists from National Natural Science Foundation of China in 2005, and the Specially-invited Professor of the “Yangtse River Scholar Award Program” by the China Ministry of Education in 2009. His main creative contributions include mobile intelligent network, service network intelligent, networking architectures and protocols, and multimedia communication. These achievements were conferred the “National Prize for Progress in Science and Technology” twice in 2004 and 2009, respectively.