A multi-scale descriptor for real time RGB-D hand gesture recognition

doi:10.1016/j.patrec.2020.11.011

Pattern Recognition Letters

Volume 144, April 2021, Pages 97-104

https://doi.org/10.1016/j.patrec.2020.11.011 Get rights and content

Highlights

•
A new RGB-D shape-based hand gesture recognition method is proposed.
•
A new hand shape descriptor is proposed with emphasized finger feature.
•
Different recognition engines are explored and compared for different applications.
•
Sota accuracies are achieved on several benchmarks, as well as excellent efficiency.
•
A demo video is given for a real life application of human-computer interaction.

Abstract

The development of depth cameras, e.g., the Kinect sensor, provides new opportunities for human computer interaction (HCI). Although the Kinect sensor has been extensively applied for human tracking, human action recognition and hand gesture recognition, real time hand gesture recognition is still a challenging problem. In this paper, a new real time hand gesture recognition method is proposed. Since fingers are the most important clue for hand gesture classification, a finger-emphasized multi-scale descriptor is proposed. The proposed descriptor incorporates three types of parameters of multiple scales to make a discriminative representation of the hand shape. Furthermore, the features of fingers are emphasized for hand gesture analysis. Three solutions to hand gesture recognition are then investigated with DTW, SVM, and neural network. Extensive experiments are conducted and the results show that the proposed method is robust to noise, articulations and rigid transformations. The comparison with state-of-the-art methods verifies the accuracy and efficiency of our method.

Introduction

Hand gesture recognition has been an important topic in computer vision for its extensive applications in human-computer interaction (HCI), including virtual reality, sign language recognition and computer games [1]. Adopting hand gesture as an interface allows communications and manipulations in the non-contact environments. Traditional methods attach sensors or markers to the fingers, e.g., the data gloves [2], [3], to capture hand gestures via electro-mechanical or magnetic sensing. These methods are effective in providing complete and real-time measurements of hand gestures, however, they hinder the natural motion of hand and are unapplicable in non-contact environments. Moreover, the devices are expensive for casual use and require complex calibration.

The vision-based hand gesture recognition methods [4], [5], [6], [7], [8] give an alternative solution to the problems, which can be used naturally in non-contact environments. However, due to the limitations of the optical sensors, the captured images are sensitive to lighting conditions and cluttered backgrounds. Thus these methods usually cannot detect and track the hand robustly. Therefore, the traditional vision-based methods are far from satisfactory for real-life applications.

With the development of the depth cameras, e.g., the Kinect sensor [9], hand gesture recognition can be explored in a new form. The hand gesture can be detected and recognized by the fusion of both the color image and depth map. However, the hand occupies a small area of the image with significant articulations, noises and distortions, which affects the recognition result. The classic shape recognition methods, e.g., the shape-context-based methods [10], [11] and the skeleton-based methods [12], [13], cannot recognize hand gestures robustly under severe articulations and distortions. The part-based methods [14], [15] were proposed to solve these problems, but they cannot capture complete hand shape features for sufficient robustness and accuracy. Some recent contour based methods [16] represent shapes with both local and global features in order to capture full shape information, but these methods are not so efficient for real time applications. It is still a challenging problem to use depth sensor for real time hand gesture recognition.

In this work, a new real time hand gesture recognition method is proposed. Inspired by the invariant multi-scale shape descriptor IMD [16], we propose a finger-emphasized multi-scale descriptor (FMD) for hand shape representation. Since the finger features are the most important clue for hand gesture classification, the shape features of fingers are emphasized in FMD, which makes the hand shape representation discriminative. The extracted invariant shape features are robust to rigid transformation, articulation and noise. Moreover, the multi-scale representation with both local and semi-global shape features makes the FMD a complete descriptor. The proposed hand pose representation can be combined with traditional classification methods for sequential data, such as Dynamic Time Warping (DTW), Support Vector Machine (SVM), and Back Propagation Neural Network (BPNN), for various applications.

Fig. 1 shows the framework of the proposed real time hand gesture recognition system. The Kinect sensor is used to capture both the color image and depth map of hand gestures as input. Hand is detected and segmented from the cluttered background. The hand shape is then represented by the proposed FMD descriptor, and recognized by the recognition methods.

Extensive experimental results validate that our method is robust to noise, articulated variation and rigid transformation. The recognition accuracy is evaluated on the latest challenging hand gesture datasets [14], [17], [18], and our method outperforms state-of-the-art methods a lot. The mean running time of our method based on BP neural network is less than 1ms, which supports real time applications. Moreover, the real time hand gesture recognition is implemented and the demo is shown in the supplementary material.

After a brief review of the related work in Section 2, the hand gesture detection is introduced in Section 3. Section 4 describes the FMD descriptor and the recognition engines are presented in Section 5. The experimental results and empirical evaluations are given in Section 6. This work is concluded in Section 7. This paper is an extension of the conference paper [19].

Section snippets

Related work

There are various vision-based hand gesture recognition methods proposed in the literature [20], [21], [22], and most of them are summarized in Murthy and Jadon [6], Erol et al. [23]. Generally, there are two main categories. One category is statistic model based methods, e.g., HMM models [20] and particle filtering [22]. The other category is based on a set of predefined rules [21]. However, most of the hand gesture recognition methods cannot operate well in cluttered environments. The color

Hand detection

In this work, we use the Kinect sensor as an input device to detect hands. The hand shape is detected from the RGB-D data. The hand position is located using the hand tracking function of the Kinect windows SDK. Then, the hand region is obtained by thresholding the depth image between a certain interval, as shown in Fig. 1. After detecting the hand region by RANSAC, the hand shape is segmented from the wrist. The contours of the segmented hand gestures are noisy and distorted, as the binary

Hand gesture description

The representation of hand gestures is desired to be invariant, robust, discriminative, and complete. For good performance of classification, the intra-class distances are expected to be small, while the inter-class distances are expected to be large. To this end, the description is necessary to be complete to take account of full features and make a discriminatve representation. In this work, a finger-emphasized multi-scale descriptor (FMD) is proposed based on the invariant multi-scale

Hand gesture recognition

In this work, we explore the hand gesture recognition engines based on three different algorithms: the dynamic time warping (DTW) [32], Support Vector Machine (SVM), and the back propagation neural network (BP) [33], for various applications.

Since the representation of hand gesture is a sequence of FMD parameters, the pairwise matching algorithm is an intuitive solution, thus the DTW algorithm is a preferred choice for its ability in non-linear matching. Given two FMD sequences $I_{A}$ and $I_{B}$ with

Experiments

In this section, we evaluate the capability of the proposed method in four aspects: (1) demonstrate the robustness of our method to noises, articulated variations and rigid transformations; (2) evaluate the accuracy and efficiency of our method on the challenging hand gesture datasets by an extensive comparative study; (3) test the performance of our method in real time application of hand gesture recognition; (4) verify that the proposed FMD descriptor can be used with different classifiers

Conclusions

In this work, we propose a hand gesture recognition system using the Kinect sensor. A finger-emphasized multi-scale descriptor is proposed for hand gesture representation, which is robust to noises, hand articulations and rigid transformations. We perform three hand gesture recognition methods by DTW, SVM, and BP, respectively. Extensive experiments on the benchmark hand gesture datasets validate the robustness, accuracy and efficiency of our method. The proposed descriptor can be flexibly

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC No. 61773272), the Six Talent Peaks Project of Jiangsu Province, China (No. XYDXX-053), and the Suzhou Key Industry Technology Innovation-Prospective Application Research Project, Jiangsu, China (Grant No. SYG201711).

References (36)

C. Chua et al.
Model-based 3D hand posture estimation from a single 2D image
Image Vis. Comput.
(2002)
J. Cheng et al.
Feature fusion for 3D hand gesture recognition by learning a shared hidden space
Pattern Recognit. Lett.
(2012)
F. Dominio et al.
Combining multiple depth-based descriptors for hand gesture recognition
Pattern Recognit. Lett.
(2014)
J. Yang et al.
Invariant multi-scale descriptor for shape representation, matching and retrieval
Comput. Vis. Image Underst.
(2016)
A. Erol et al.
Vision based hand pose estimation: a review
Comput. Vis. Image Underst.
(2007)
J. Yang et al.
Parsing 3D motion trajectory for gesture recognition
J. Vis. Commun. Image Represent
(2016)
J.P. Wachs et al.
Vision-based handgesture applications
Commun. ACM
(2011)
G. Dewaele et al.
Hand motion from 3d point trajectories and a smooth surfacemodel
Proc. ECCV
(2004)
E. Foxlin
Motion tracking requirements and technologies
Handbook of Virtual Environment Technology
(2002)
B. Stenger et al.
Filtering using a tree-based estimator
Proc. of IEEE ICCV
(2003)

G.R.S. Murthy et al.

A review of vision based hand gesture recognition

Int. J. Inf. Technol. Knowl. Manag.

(2009)

J. Shotton et al.

Real-time human pose recognition in parts from single depth images

Proc. IEEE CVPR

(2011)

S. Belongie et al.

Shape matching and object recognition using shape contexts

IEEE Trans. Pattern Anal. Mach. Intell.

(2002)

H. Ling et al.

Shape classification using the inner-distance

IEEE Trans. Pattern Anal. Mach. Intell.

(2007)

X. Bai et al.

Path similarity skeleton graph matching

IEEE Trans. Pattern Anal. Mach. Intell.

(2008)

K. Siddiqi et al.

Hamilton Jacobi skeletons

Int. J. Comput. Vis.

(2002)

Z. et al.

Robust part-based hand gesture recognition using kinect sensor

IEEE Trans. Multimedia

(2013)

Z. Ren et al.

Minimum near-convex shape decomposition

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

Cited by (31)

Feature enhancement and coarse-to-fine detection for RGB-D tracking
2024, Pattern Recognition Letters
Existing RGB-D tracking algorithms advance the performance by constructing typical appearance models from the RGB-only tracking frameworks. There is no attempt to exploit any complementary visual information from the multi-modal input. This paper addresses this deficit and presents a novel algorithm to boost the performance of RGB-D tracking by taking advantage of collaborative clues. To guarantee input consistency, depth images are encoded into the three-channel HHA representation to create input of a similar structure to the RGB images, so that the deep CNN features can be extracted from both modalities. To highlight the discriminatory information in multi-modal features, a feature enhancement module using a cross-attention strategy is proposed. With the attention map produced by the proposed cross-attention method, the target area of the features can be enhanced and the negative influence of the background is suppressed. Besides, we address the potential tracking failure by introducing a long-term mechanism. The experimental results obtained on the well-known benchmarking datasets, including PTB, STC, and CTDB, demonstrate the superiority of the proposed RGB-D tracker. On PTB, the proposed method achieves the highest AUC scores against compared trackers across scenarios with five distinct challenging attributes. On STC and CDTB, our FECD obtains an overall AUC of 0.630 and an F-score of 0.630, respectively.
Learning full context feature for human motion prediction
2023, Journal of Visual Communication and Image Representation
Human motion prediction aims to predict the target poses given the previous poses. Most existing methods are devoted to extracting richer motion features from only the given previous poses to predict the target poses. However, we consider that the post poses after the target poses are helpful in acquiring the context feature and constraint between neighbor motions, which is also important for motion prediction. In this paper, we explore to make use of the post motion information for a powerful human motion prediction method. Specifically, we propose a human motion prediction model which learns the motion constraint from both the previous and post poses, in order to fully utilize the context features of the target poses. During training, the proposed memory dictionary module is used to learn the mapping from previous features to post features. In testing, the proposed memory dictionary module fully exploits the learned mapping to calculate the future motion feature according to the input previous feature. Thus, the context feature of human motion is enriched in our method. We evaluate the proposed method on two large-scale datasets, Human3.6M and CMU-Mocap. The experimental results demonstrate that our method improves the motion prediction performance, especially for long-term human motion.
A versatile interaction framework for robot programming based on hand gestures and poses
2023, Robotics and Computer-Integrated Manufacturing
This paper proposes a framework for industrial and collaborative robot programming based on the integration of hand gestures and poses. The framework allows operators to control the robot via both End-Effector (EE) and joint movements and to transfer compound shapes accurately to the robot. Seventeen hand gestures, which cover the position and orientation controls of the robotic EE and other auxiliary operations, are designed according to cognitive psychology. Gestures are classified by a deep neural network, which is pre-trained for two-hand pose estimation and fine-tuned on a custom dataset, achieving a test accuracy of 99%. The index finger’s pointing direction and the hand’s orientation are extracted via 3D hand pose estimation to indicate the robotic EE’s moving direction and orientation, respectively. The number of stretched fingers is detected via two-hand pose estimation to represent decimal digits for selecting robot joints and inputting numbers. Finally, we integrate these three manners seamlessly to form a programming framework.
We conducted two interaction experiments. The reaction time of the proposed hand gestures in indicating randomly given instructions is significantly less than that of other gesture sets, such as American Sign Language (ASL). The accuracy of our method in compound shape reconstruction is much better than that of hand movement trajectory-based methods, and the operating time is comparable with that of teach pendants.
Learning dynamic relationship between joints for 3D hand pose estimation from single depth map
2023, Journal of Visual Communication and Image Representation
3D hand pose estimation from a single depth map is an essential topic in computer vision. Most existing methods are devoted to designing a model to capture more spatial information or designing loss functions based on prior knowledge to constrain the estimated pose with prior spatial information. In this work, we focus on constraining the estimation process with spatial information adaptively by learning the mutual position relationship between joint pairs. Specifically, we propose a dynamic relationship network (DRN) with dynamic anchors. The preset fixed anchors are employed to estimate the position of each joint initially. Then, each joint is considered a dynamic anchor, which plays the role of a dynamic regressor to adjust the initially estimated position of each joint. The final estimation of each joint is the weighted sum of the results from all the dynamic anchors. Extensive experiments on benchmarks demonstrate that our method provides competitive results compared with state-of-the-arts.
Applying deep neural networks for the automatic recognition of sign language words: A communication aid to deaf agriculturists
2021, Expert Systems with Applications
Citation Excerpt :
The major disadvantage of the conventional machine learning techniques lies in choosing the appropriate feature descriptors. Feature extraction is not embedded as a part of these classification models, and a long trial and error process is needed to decide which features best describe different classes of gestures (Huang & Yang, 2021; Kowdiki & Khaparde, 2021). Conventional approaches to HGR often fail for realistic applications due to many challenging factors like segmentation and tracking of hands from complex and uncontrolled backgrounds, derivation of discriminative feature descriptors, dimensionality reduction and feature selection (Pramod & Martin, 2015), elimination of movement epenthesis (Elakkiya, 2020; Neena & Geetha, 2020) etc.
One of the major challenges that deaf people face in modern societal life is communication. For those engaged in agricultural jobs, efficiency at work and productivity are deeply related to the quality of deciphering the sign language used by the deaf farmers. Employing sign language interpreters is not a pragmatic solution to this problem. There comes the need for developing a reliable system for automatic sign language recognition (SLR). This paper reports a work on the recognition of hand gestures for the Indian sign language (ISL) words commonly used by deaf farmers. A hybrid deep learning model with convolutional long short term memory (LSTM) network has been exploited for gesture classification. The model has attained an average classification accuracy of 76.21% on the proposed dataset of ISL words from the agricultural domain.
A new weighted multi-scale descriptor for hand gesture recognition
2024, Multimedia Tools and Applications

View all citing articles on Scopus

View full text

A multi-scale descriptor for real time RGB-D hand gesture recognition

Highlights

Abstract

Introduction

Section snippets

Related work

Hand detection

Hand gesture description

Hand gesture recognition

Experiments

Conclusions

Declaration of Competing Interest

Acknowledgments

Image Vis. Comput.

Pattern Recognit. Lett.

Pattern Recognit. Lett.

Comput. Vis. Image Underst.

Comput. Vis. Image Underst.

J. Vis. Commun. Image Represent

Vision-based handgesture applications

Commun. ACM

Hand motion from 3d point trajectories and a smooth surfacemodel

Proc. ECCV

Motion tracking requirements and technologies

Handbook of Virtual Environment Technology

Filtering using a tree-based estimator

Proc. of IEEE ICCV

A review of vision based hand gesture recognition

Int. J. Inf. Technol. Knowl. Manag.

Real-time human pose recognition in parts from single depth images

Proc. IEEE CVPR

Shape matching and object recognition using shape contexts

IEEE Trans. Pattern Anal. Mach. Intell.

Shape classification using the inner-distance

IEEE Trans. Pattern Anal. Mach. Intell.

Path similarity skeleton graph matching

IEEE Trans. Pattern Anal. Mach. Intell.

Hamilton Jacobi skeletons

Int. J. Comput. Vis.

Robust part-based hand gesture recognition using kinect sensor

IEEE Trans. Multimedia

Minimum near-convex shape decomposition

IEEE Trans. Pattern Anal. Mach. Intell.