Spatial-aware stacked regression network for real-time 3D hand pose estimation

doi:10.1016/j.neucom.2021.01.045

Neurocomputing

Volume 437, 21 May 2021, Pages 42-57

https://doi.org/10.1016/j.neucom.2021.01.045 Get rights and content

Highlights

•
A stacked regression network for fast, robust and accurate 3D hand pose estimation is proposed.
•
A pose re-parameterization is adopted to utilize the 3D spatial structure of hand.
•
A spatial attention module is adopted to reduce the influence of irrelevant features.
•
A cross-stage self-distillation module is adopted to achieve a lightweight network.

Abstract

Making full use of the spatial information of the depth data is crucial for 3D hand pose estimation from a single depth image. In this paper, we propose a Spatial-aware Stacked Regression Network (SSRN) for fast, robust and accurate 3D hand pose estimation from a single depth image. By adopting a differentiable pose re-parameterization process, our method efficiently encodes the pose-dependent 3D spatial structure of the depth data as spatial-aware representations. Taking such spatial-aware representations as inputs, the stacked regression network utilizes multi-joint spatial context and the 3D spatial relationship between the estimated pose and the depth data to predict a refined hand pose. To further improve the estimation accuracy, we adopt a spatial attention mechanism to reduce the influence of irrelevant features for pose regression. In order to improve the speed of the network, we propose a cross-stage self-distillation mechanism to distill knowledge within the network itself. Experiments on four datasets show that our proposed method achieves state-of-the-art accuracy with high running speed around 330 FPS on a single GPU and 35 FPS on a single CPU.

Introduction

Hand pose estimation plays an important role in applications of human–computer interaction in virtual reality and augmented reality. With the development of deep learning and low-cost depth camera, it has made significant progress in recent years [1], [2], [3]. Nevertheless, it is still challenging to achieve accurate and robust hand pose estimation performance due to large variations in hand orientations, high similarity among fingers, severe self-occlusion and poor quality of depth images. In addition, improving the speed of the algorithm is also an important issue in order to satisfy the requirements of interactive application scenarios.

Recently, deep neural network-based approaches have achieved drastic performance improvement in 3D hand pose estimation. One line of work for 3D hand poses estimation is holistic regression [4], [5], [6], [7], [8], [9], that is aiming to directly predict 3D pose parameters such as joint angles or 3D joint locations from the depth image. Regression-based methods are able to capture global constraints among different joints, thus they are robust to self-occlusion and poor quality images [2]. However, since these methods treat the depth image as a 2D image, they under-utilize 3D spatial information of the depth image, thus having relatively low precision.

Explicit consideration of 3D spatial properties of the depth image can significantly improve the accuracy of estimation [2]. One straightforward solution is converting the depth image to 3D data, such as voxels [10], [11] and points [12], [13], [14], [15], [16], and then applying a 3D deep learning method. An alternative way is to incorporate spatial-aware representations into 2D CNNs. These works utilize the fully convolutional network to estimate pixel-wise representations such as 3D heat-maps, 3D unit vector fields and approximated geodesic distance maps, from which the 3D hand joint locations can be inferred by post-processing [17], [18], [19] or a regression network [20]. Adopting spatial-aware representations allows 2D CNNs to consider both the 2D and 3D properties of the depth image and makes it easy to leverage stacked architecture to reevaluate the initial estimations. However, those methods still suffer from some limitations.

First, performing pixel-wise estimation is inefficient, because it needs a computationally expensive upsampling step to obtain high-resolution pixel-wise estimations [17], [19], [21]. Thus, compared with regression-based methods, these methods require more complex network structures and more computation. Second, when the depth data near the target joint points are missing or occluded, directly inferring the joint coordinates from the spatial-aware representations is unreliable [13].

In essence, the spatial-aware representations are embeddings of low-dimensional joint coordinates on high-dimensional image space. They reflect the spatial location of joints or the spatial relationship such as distance, direction and offset between the joint coordinates and each pixel of the depth image. They are able to provide rich 3D spatial information and powerful disambiguation clues for further optimization. We argue that the form of representation is more critical than the process of representation generation for iterative refinement. Based on this inspiration, we propose a stacked regression architecture, which directly encodes the previously estimated pose into spatial-aware representations by a pose re-parameterization, thus subsequent regression stages are able to utilize multi-joint spatial context and the 3D spatial relationship between the estimated pose and the depth data to perform iterative refinement.

In this paper, we propose a Spatial-aware Stacked Regression Network (SSRN) for fast, robust and accurate 3D hand pose estimation from a single depth image. SSRN has multiple pose regression modules, which are connected by the differentiable pose re-parameterization module. Specifically, the pose re-parameterization module generates spatial-aware representations from previously estimated pose directly. Then, the subsequent pose regression module predicts a more accurate pose based on multi-joint spatial context and the 3D spatial relationship between the estimated pose and the depth data on the spatial-aware representations. We regard the first pose regression module as the initial stage and the subsequent regression module as the refinement stage. The pose re-parameterization process is simple, fast and non-parametric. It can generate high-quality spatial-aware representations with little overhead in computation and storage. Furthermore, we integrate and explore multiple good practices including data augmentation, smooth L1 loss, localization refinement and coordinate decoupling to improve the performance of 2D CNN for 3D hand pose estimation.

Our main contributions can be summarized as follows:

(1) We adopt a differentiable pose re-parameterization process to generate spatial-aware representations. Compared with performing pixel-wise estimation, it can generate high-quality spatial-aware representations with little overhead in computation and storage.
(2) We incorporate the spatial-aware representations into a stacked regression network. Spatial-aware representations allow us to efficiently utilize the 3D spatial information in the depth image and multi-joint spatial context for accurate 3D hand pose estimation.

This paper is an extension of our conference paper [22]. The new contributions of this paper are summarized as follows:

(1) We propose a spatial attention mechanism to replace the global average pooling (GAP) and fully connected (FC) layer in the regression-based method to reduce the influence of irrelevant regions in the feature maps for pose estimation. Experimental results show that, with the spatial attention mechanism, the estimation accuracy of the regression-based method can be further improved.
(2) We propose a cross-stage self-distillation mechanism to align the features of the initial stage with the refinement stage, which allows us to adopt a more lightweight network in the initial stage while maintaining the estimation accuracy so that the whole network has faster reasoning speed.
(3) We conduct more extensive self-comparison experiments and a cross-dataset experiment. Experimental results show that our method has good generalization ability.

We evaluate our method on four publicly available 3D hand pose datasets (NYU [1], ICVL [23], MSRA [24], HANDS 2017 [2]). Our method achieves state-of-the-art accuracy on four datasets with fewer parameters and a faster frame rate. The reasoning speed of our method is around 330 FPS on a single GPU and 35 FPS on a single CPU.

Section snippets

Depth-based 3D hand pose estimation

The methods of estimating 3D hand pose from a single depth image can be categorized into three classes: generative methods, discriminative methods and hybrid methods. Generative methods [25], [26], [27], [28], [29], [30], [31], [32], [33] use a pre-defined 3D hand model to fit depth input. Its effectiveness heavily relies on the construction of the hand model and the definition of the energy function. These approaches need a time-consuming optimizing procedure and are likely to trap into local

Overview

In order to better capture the spatial structure of the depth data, our method aims at incorporating spatial-aware representations into the network inference. However, we also want to avoid performing pixel-wise estimation due to the computational overhead and lack of robustness. To that end, we adopt a pose re-parameterization process to encode an estimated pose into spatial-aware representations directly. Fig. 1 illustrates the architecture of our method. SSRN contains a feature extraction

Dataset and evaluation metric

We conduct experiments on four publicly available datasets: NYU dataset [1], ICVL dataset [23], MSRA dataset [24] and HANDS 2017 dataset [2].

NYU dataset [1] consists of 72 K training and 8.2 K testing depth images captured by the PrimeSense 3D sensor. The annotation of hand pose contains 36 joints. Following previous works [6], [11], we selected 14 joints from all annotated joints during training and testing.

ICVL dataset [23] consists of 330 K training and 1.6 K testing depth images captured by

Conclusion

In this paper, we propose a Spatial-aware Stacked Regression Network (SSRN) for fast, robust and accurate 3D hand pose estimation from a single depth image. We utilize a differentiable pose re-parameterization module to efficiently generate a high-quality spatial-aware representation from the previously estimated pose. Taking such representations as input, a pose regression module allows the SSRN to utilize the 3D spatial structure of depth data and multi-joint spatial context to reevaluate the

CRediT authorship contribution statement

Pengfei Ren: Conceptualization, Methodology, Software, Writing - original draft, Investigation, Validation, Data curation, Formal analysis, Writing - review & editing. Haifeng Sun: Resources, Supervision, Project administration, Funding acquisition. Weiting Huang: Investigation, Validation, Writing - review & editing. Jiachang Hao: Resources, Formal analysis, Writing - review & editing. Daixuan Cheng: Writing - review & editing. Qi Qi: Resources, Supervision, Project administration, Funding

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grants 61671079, 61771068, and in part by the Beijing Municipal Natural Science Foundation under Grant 4182041. This work was also supported by BUPT Excellent Ph.D. Students Foundation CX2020121.

Pengfei Ren is obtained his B.S. degree from Beijing University of Posts and Telecommunications in 2018, where he is currently working toward his M.S. degree. His research interest includes deep learning, hand pose estimation and gesture recognition.

References (80)

R. Li et al.
A survey on 3D hand pose estimation: Cameras, methods, and datasets
Pattern Recogn.
(2019)
J. Tompson, M. Stein, Y. Lecun, K. Perlin, Real-time continuous pose recovery of human hands using convolutional...
S. Yuan et al.
Depth-based 3d hand pose estimation: From current achievements to future goals
M. Oberweger, P. Wohlhart, V. Lepetit, Hands deep in deep learning for hand pose estimation, in: Proceedings of the...
H. Guo et al.
Region ensemble network: Improving convolutional network for hand pose estimation
X. Chen, G. Wang, H. Guo, C. Zhang, Pose guided structured region ensemble network for cascaded hand pose estimation,...
M. Oberweger, V. Lepetit, Deepprior++: Improving fast and accurate 3d hand pose estimation, in: Proceedings of the IEEE...
M. Madadi, S. Escalera, X. Baró, J. Gonzalez, End-to-end global to local cnn learning for hand pose recovery in depth...
M. Oberweger, P. Wohlhart, V. Lepetit, Training a Feedback Loop for Hand Pose Estimation, in: Proceedings of the IEEE...
L. Ge, H. Liang, J. Yuan, D. Thalmann, 3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation...

G. Moon et al.

V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation From a Single Depth Map

L. Ge, Y. Cai, J. Weng, J. Yuan, Hand pointnet: 3d hand pose estimation using point sets, in: Proceedings of the IEEE...

L. Ge, Z. Ren, J. Yuan, Point-to-point regression pointnet for 3d hand pose estimation, in: Proceedings of the European...

S. Li et al.

Point-to-pose voting based hand pose estimation using residual permutation equivariant layer

Y. Chen et al.

So-handnet: Self-organizing network for 3d hand pose estimation with semi-supervised learning

X. Chen et al.

Shpr-net: Deep semantic hand pose regression from point clouds

IEEE Access

(2018)

C. Wan et al.

Dense 3d regression for hand pose estimation

F. Xiong et al.

A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image

W. Huang et al.

AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation

X. Wu et al.

Handmap: Robust hand pose estimation via intermediate dense guidance map supervision

B. Xiao et al.

Simple baselines for human pose estimation and tracking

P. Ren, H. Sun, Q. Qi, J. Wang, W. Huang, SRN: Stacked Regression Network for Real-time 3D Hand Pose Estimation., in:...

D. Tang et al.

Latent regression forest: Structured estimation of 3d articulated hand posture

X. Sun et al.

Cascaded hand pose regression

I. Oikonomidis, N. Kyriazis, A. A. Argyros, Efficient model-based 3D tracking of hand articulations using Kinect., in:...

C. Qian et al.

Realtime and robust hand tracking from depth

S. Khamis et al.

Learning an efficient model of hand shape variation from depth images

S. Sridhar et al.

Fast and robust hand tracking using detection-guided optimization

A. Tagliasacchi et al.

Robust articulated-ICP for real-time hand tracking

Computer Graphics Forum

(2015)

A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, A. Fitzgibbon, Online generative model personalization for hand...

L. Ballan et al.

Motion capture of hands in action using discriminative salient points

J. Taylor, L. Bordeaux, T. Cashman, B. Corish, C. Keskin, T. Sharp, E. Soto, D. Sweeney, J. Valentin, B. Luff, et al.,...

M. Ye et al.

Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera

J. Romero, H. Kjellström, D. Kragic, Monocular real-time 3D articulated hand pose estimation, in: IEEE-RAS...

G. Shakhnarovich et al.

Fast pose estimation with parameter-sensitive hashing

D. Tang et al.

Latent regression forest: structured estimation of 3d hand poses

IEEE Trans. Pattern Anal. Mach. Intell.

(2016)

T. Sharp et al.

Accurate, robust, and flexible real-time hand tracking

S. Sridhar et al.

Real-time joint tracking of a hand manipulating an object from rgb-d input

C. R. Qi, L. Yi, H. Su, L. J. Guibas, Pointnet++: Deep hierarchical feature learning on point sets in a metric space,...

J. Wang et al.

Generative Model-Based Loss to the Rescue: A Method to Overcome Annotation Errors for Depth-Based Hand Pose Estimation

Cited by (20)

In-bed human pose estimation using multi-source information fusion for health monitoring in real-world scenarios
2024, Information Fusion
In the field of 3D human pose estimation for in-bed health monitoring in real-world scenarios, challenging problems arise due to varying conditions of illumination and depth ambiguity. To address these issues, we followed the principle of 2D-to-3D pipeline and developed a novel 3D skeletons tracking approach based on multi-source image fusion. The utilization of thermal images allows us to estimate the 2D pose, as they are unaffected by changes in illumination and can be used even in complete darkness. Furthermore, depth images are employed to infer the pose depth, benefiting from their excellent depth resolution. During the 2D pose estimation phase, in order to exploit both spatial and temporal information, we have proposed a module for 2D pose estimation based on Graph Convolutional Network and Transformer network. To train this module, we have created a dataset using an independently developed automatic annotation method. In the subsequent 3D pose estimation phase, the typical 2D-to-3D lifting approaches entail inferring a 3D human pose from an intermediate 2D pose estimated by a deep network, but the task is an ill-problem due to the lack of depth resolution in thermal images. In our study, we overcome this challenge by inferring the pose depth through fusion with depth images, which offer superior depth resolution but are subject to limitations posed by illumination and texture conditions. We found that, (i) the developed automatic annotation method yields accurate pose annotations for training 2D pose estimation based on infrared thermal images; (ii) on the proposed dataset, our 2D pose estimation framework achieved a mean error of below 4 pixels and outperformed state-of-the-art methods; (iii) it achieved precise 3D pose estimation via fusion with depth images; and (iv) future research should consider applying the outlined approach to support the development of novel human pose models for health monitoring.
A hand motion capture method based on infrared thermography for measuring fine motor skills in biomedicine
2023, Artificial Intelligence in Medicine
Citation Excerpt :
Hand pose estimation (HPE) aims to localizing complex hand anatomical keypoints from images or video sequences as a fundamental motion capture technology in computer vision and has been used in many fields, such as human–computer interaction, virtual reality, movies and animation, sports motion analysis [1–5].
Many biomedical applications require fine motor skill assessments; however, real-time and contactless fine motor skill assessments are not typically implemented. In this study, we followed the 2D-to-3D pipeline principle and proposed a transformer-based spatial–temporal network to accurately regress 3D hand joint locations by inputting infrared thermal video for eliminating need of multiple cameras or RGB-D devices. We also developed a dataset composed of infrared thermal videos and ground truth annotations for training. The label represents a set of 3D joint locations from infrared optical trackers, which is considered the gold standard for clinical applications. To demonstrate their potential, the proposed method was used to measure the finger motion angle, and we investigated its accuracy by comparing the proposal with the Azure Kinect system and Leap Motion system. On the proposed dataset, the proposed method achieved a 3D hand pose mean error of less than 14 mm and outperforms the other deep learning methods. When the error thresholds were larger than approximately 35 mm, our method first to achieved excellent performance ( $>$ 80%) in terms of the fraction of good frames. For the finger motion angle calculation task, the proposed and commercial systems had comparable inter-system reliability (ICC $_{2, 1}$ ranging from 0.81 to 0.83) and excellent validity (Pearson’s $r$ -values ranging from 0.82 to 0.86). We believe that the proposed approaches can capture hand motion and measure finger motion angles and can be used in different biomedicine scenarios as an effective evaluation tool for fine motor skills.
Dual-channel cascade pose estimation network trained on infrared thermal image and groundtruth annotation for real-time gait measurement
2022, Medical Image Analysis
Citation Excerpt :
This challenge remains even with deep learning approaches. Likewise, depth based pose estimation and analysis via deep neural networks and the subsequent production of large human pose estimation datasets has also attracted considerable attention over the last decade (He et al., 2015; He et al., 2016; Vasileiadis et al., 2019; Zhang et al., 2020; Ren et al., 2021). Various efforts have been made in the domain of gait analysis using depth-based human pose estimation (Eltoukhy et al., 2017b; Latorre et al., 2018; Summa et al., 2020; Hazra et al., 2021).
Real-time spatiotemporal parameter measurement for gait analysis is challenging. Previous techniques for 3D motion analysis, such as inertial measurement units, marker based motion analysis or the use of depth cameras, require expensive equipment, highly skilled staff and limits feasibility for sustainable applications. In this paper a dual-channel cascaded network to perform contactless real-time 3D human pose estimation using a single infrared thermal video as an input is proposed. An algorithm to calculate gait spatiotemporal parameters is presented by tracking estimated joint locations. Additionally, a training dataset composed of infrared thermal images and groundtruth annotations has been developed. The annotation represents a set of 3D joint locations from infrared optical trackers, which is considered to be the gold standard in clinical applications. On the proposed dataset, our pose estimation framework achieved a 3D human pose mean error of below 21 mm and outperforms state-of-the-art methods. The results reveal that the proposed system achieves competitive skeleton tracking performance on par with the other motion capture devices and exhibited good agreement with a marker-based three-dimensional motion analysis system (3DMA) over a range of spatiotemporal parameters. Moreover, the process is shown to distinguish differences in over-ground gait parameters of older adults with and without Hemiplegia’s disease. We believe that the proposed approaches can measure selected spatiotemporal gait parameters and could be effectively used in clinical or home settings.
3D interacting hand pose and shape estimation from a single RGB image
2022, Neurocomputing
Citation Excerpt :
With the rapid development of artificial intelligence, hand reconstruction techniques have exhibited widespread prospects for application and great commercial values in human-computer interaction services, e.g., virtual control, AR/VR assistance, to name a few. Many hand reconstruction methods estimate 3D hand pose from depth image [1–6], which usually need the extra support of RGB-D sensor [7] and result in the extra cost of deployment consequently. In practice, estimating hand pose and shape directly from a single RGB image is more applicable and attracts much attention in recent research [8–19].
Estimating 3D interacting hand poses and shapes from a single RGB image is challenging as it is difficult to distinguish the left and right-hands in interacting hand pose analysis. This paper proposes a network called GroupPoseNet using a grouping strategy to address this problem. GroupPoseNet extracts the left- and right-hand features respectively and thus avoids the mutual affection between the interacting hands. Empowered by a novel up-sampling block called MF-Block predicting 2D heat-maps in a progressive way by fusing image features, hand pose features, and multi-scale features, GroupPoseNet is effective and robust to severe occlusions. To achieve an effective 3D hand reconstruction, we design a transformer mechanism based inverse kinematics module(termed TikNet) to map 3D joint locations to hand shape and pose parameters of MANO hand model. Comprehensive experiments on the InterHand2.6M dataset show GroupPoseNet outperforms existing methods by a significant margin. Additional experiments also demonstrate it has a good generalization ability in the problems including left-hand, right-hand and interacting hand pose estimation from a single RGB image. We also show the efficiency of TikNet by the quantitative and qualitative results.
Multi-virtual View Scoring Network for 3D Hand Pose Estimation from a Single Depth Image
2024, Communications in Computer and Information Science
Two Heads Are Better than One: Image-Point Cloud Network for Depth-Based 3D Hand Pose Estimation
2023, Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023

View all citing articles on Scopus

Haifeng Sun is obtained his PhD degree from Beijing University of Posts and Telecommunications in 2017. Now he is an lecture in Beijing University of Posts and Telecommunications. His research interest includes data mining, information retrieval, and Next Generation Network.

Weiting Huang is now a graduate student in Beijing University of Posts and Telecommunications. Her research interest is computer vision.

Jiachang Hao is obtained his Bachelor’s degree from Beijing University of Posts and Telecommunications in 2018. Now he is postgraduate in Beijing University of Posts and Telecommunications. His research interest includes video understanding, action recognition and object detection

Daixuan Cheng is a bachelor student under the supervision of Prof. Jingyu Wang in the Network Intelligence Research Center at Beijing University of Posts and Telecommunications. She worked on various projects with his supervisor Prof. Haifeng Sun. She is interested in natural language processing, computer vision, machine learning and data mining.

Qi Qi received the Ph.D. degree from the Beijing University of Posts and Telecommunications, in 2010, where she is currently an Associate Professor with the State Key Laboratory of Networking and Switching Technology. She has published more than 30 papers in international journal, and received two National Natural Science Foundations of China. Her research interests include ubiquitous services, deep learning, transfer learning, deep reinforcement learning, edge computing, and the Internet of Things.

Jingyu Wang was born in 1978, obtained his Ph.D degree from Beijing University of Posts and Telecommunications in 2008. Now he is the full professor of State Key laboratory of Networking and Switching Technology in Beijing University of Posts and Telecommunications, China. His research interests span broad aspects of future internet, intelligent networks, machine learning and date mining. He has published hundreds of research papers and several books, and has been granted dozens of patents for inventions.

Jianxin Liao obtained his PhD degree at University of Electronics Science and Technology of China in 1996. He is currently the dean of Network Intelligence Research Center and the full professor of State Key laboratory of Networking and Switching Technology in Beijing University of Posts and Telecommunications. He has published hundreds of research papers and several books, and has been granted dozens of patents for inventions. He has won a number of prizes, which include the Premier’s Award of Distinguished Young Scientists from National Natural Science Foundation of China in 2005, and the Specially-invited Professor of the “Yangtse River Scholar Award Program” by the China Ministry of Education in 2009. His main creative contributions include mobile intelligent network, service network intelligent, networking architectures and protocols, and multimedia communication. These achievements were conferred the “National Prize for Progress in Science and Technology” twice in 2004 and 2009, respectively.

View full text

Spatial-aware stacked regression network for real-time 3D hand pose estimation

Highlights

Abstract

Introduction

Section snippets

Depth-based 3D hand pose estimation

Overview

Dataset and evaluation metric

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Pattern Recogn.

Depth-based 3d hand pose estimation: From current achievements to future goals

Region ensemble network: Improving convolutional network for hand pose estimation

V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation From a Single Depth Map

Point-to-pose voting based hand pose estimation using residual permutation equivariant layer

So-handnet: Self-organizing network for 3d hand pose estimation with semi-supervised learning

Shpr-net: Deep semantic hand pose regression from point clouds

IEEE Access

Dense 3d regression for hand pose estimation

A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image

AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation

Handmap: Robust hand pose estimation via intermediate dense guidance map supervision

Simple baselines for human pose estimation and tracking

Latent regression forest: Structured estimation of 3d articulated hand posture

Cascaded hand pose regression

Realtime and robust hand tracking from depth

Learning an efficient model of hand shape variation from depth images

Fast and robust hand tracking using detection-guided optimization

Robust articulated-ICP for real-time hand tracking

Computer Graphics Forum

Motion capture of hands in action using discriminative salient points

Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera

Fast pose estimation with parameter-sensitive hashing

Latent regression forest: structured estimation of 3d hand poses

IEEE Trans. Pattern Anal. Mach. Intell.

Accurate, robust, and flexible real-time hand tracking

Real-time joint tracking of a hand manipulating an object from rgb-d input

Generative Model-Based Loss to the Rescue: A Method to Overcome Annotation Errors for Depth-Based Hand Pose Estimation