Elsevier

Pattern Recognition Letters

Volume 144, April 2021, Pages 97-104
Pattern Recognition Letters

A multi-scale descriptor for real time RGB-D hand gesture recognition

https://doi.org/10.1016/j.patrec.2020.11.011Get rights and content

Highlights

  • A new RGB-D shape-based hand gesture recognition method is proposed.

  • A new hand shape descriptor is proposed with emphasized finger feature.

  • Different recognition engines are explored and compared for different applications.

  • Sota accuracies are achieved on several benchmarks, as well as excellent efficiency.

  • A demo video is given for a real life application of human-computer interaction.

Abstract

The development of depth cameras, e.g., the Kinect sensor, provides new opportunities for human computer interaction (HCI). Although the Kinect sensor has been extensively applied for human tracking, human action recognition and hand gesture recognition, real time hand gesture recognition is still a challenging problem. In this paper, a new real time hand gesture recognition method is proposed. Since fingers are the most important clue for hand gesture classification, a finger-emphasized multi-scale descriptor is proposed. The proposed descriptor incorporates three types of parameters of multiple scales to make a discriminative representation of the hand shape. Furthermore, the features of fingers are emphasized for hand gesture analysis. Three solutions to hand gesture recognition are then investigated with DTW, SVM, and neural network. Extensive experiments are conducted and the results show that the proposed method is robust to noise, articulations and rigid transformations. The comparison with state-of-the-art methods verifies the accuracy and efficiency of our method.

Introduction

Hand gesture recognition has been an important topic in computer vision for its extensive applications in human-computer interaction (HCI), including virtual reality, sign language recognition and computer games [1]. Adopting hand gesture as an interface allows communications and manipulations in the non-contact environments. Traditional methods attach sensors or markers to the fingers, e.g., the data gloves [2], [3], to capture hand gestures via electro-mechanical or magnetic sensing. These methods are effective in providing complete and real-time measurements of hand gestures, however, they hinder the natural motion of hand and are unapplicable in non-contact environments. Moreover, the devices are expensive for casual use and require complex calibration.

The vision-based hand gesture recognition methods [4], [5], [6], [7], [8] give an alternative solution to the problems, which can be used naturally in non-contact environments. However, due to the limitations of the optical sensors, the captured images are sensitive to lighting conditions and cluttered backgrounds. Thus these methods usually cannot detect and track the hand robustly. Therefore, the traditional vision-based methods are far from satisfactory for real-life applications.

With the development of the depth cameras, e.g., the Kinect sensor [9], hand gesture recognition can be explored in a new form. The hand gesture can be detected and recognized by the fusion of both the color image and depth map. However, the hand occupies a small area of the image with significant articulations, noises and distortions, which affects the recognition result. The classic shape recognition methods, e.g., the shape-context-based methods [10], [11] and the skeleton-based methods [12], [13], cannot recognize hand gestures robustly under severe articulations and distortions. The part-based methods [14], [15] were proposed to solve these problems, but they cannot capture complete hand shape features for sufficient robustness and accuracy. Some recent contour based methods [16] represent shapes with both local and global features in order to capture full shape information, but these methods are not so efficient for real time applications. It is still a challenging problem to use depth sensor for real time hand gesture recognition.

In this work, a new real time hand gesture recognition method is proposed. Inspired by the invariant multi-scale shape descriptor IMD [16], we propose a finger-emphasized multi-scale descriptor (FMD) for hand shape representation. Since the finger features are the most important clue for hand gesture classification, the shape features of fingers are emphasized in FMD, which makes the hand shape representation discriminative. The extracted invariant shape features are robust to rigid transformation, articulation and noise. Moreover, the multi-scale representation with both local and semi-global shape features makes the FMD a complete descriptor. The proposed hand pose representation can be combined with traditional classification methods for sequential data, such as Dynamic Time Warping (DTW), Support Vector Machine (SVM), and Back Propagation Neural Network (BPNN), for various applications.

Fig. 1 shows the framework of the proposed real time hand gesture recognition system. The Kinect sensor is used to capture both the color image and depth map of hand gestures as input. Hand is detected and segmented from the cluttered background. The hand shape is then represented by the proposed FMD descriptor, and recognized by the recognition methods.

Extensive experimental results validate that our method is robust to noise, articulated variation and rigid transformation. The recognition accuracy is evaluated on the latest challenging hand gesture datasets [14], [17], [18], and our method outperforms state-of-the-art methods a lot. The mean running time of our method based on BP neural network is less than 1ms, which supports real time applications. Moreover, the real time hand gesture recognition is implemented and the demo is shown in the supplementary material.

After a brief review of the related work in Section 2, the hand gesture detection is introduced in Section 3. Section 4 describes the FMD descriptor and the recognition engines are presented in Section 5. The experimental results and empirical evaluations are given in Section 6. This work is concluded in Section 7. This paper is an extension of the conference paper [19].

Section snippets

Related work

There are various vision-based hand gesture recognition methods proposed in the literature [20], [21], [22], and most of them are summarized in Murthy and Jadon [6], Erol et al. [23]. Generally, there are two main categories. One category is statistic model based methods, e.g., HMM models [20] and particle filtering [22]. The other category is based on a set of predefined rules [21]. However, most of the hand gesture recognition methods cannot operate well in cluttered environments. The color

Hand detection

In this work, we use the Kinect sensor as an input device to detect hands. The hand shape is detected from the RGB-D data. The hand position is located using the hand tracking function of the Kinect windows SDK. Then, the hand region is obtained by thresholding the depth image between a certain interval, as shown in Fig. 1. After detecting the hand region by RANSAC, the hand shape is segmented from the wrist. The contours of the segmented hand gestures are noisy and distorted, as the binary

Hand gesture description

The representation of hand gestures is desired to be invariant, robust, discriminative, and complete. For good performance of classification, the intra-class distances are expected to be small, while the inter-class distances are expected to be large. To this end, the description is necessary to be complete to take account of full features and make a discriminatve representation. In this work, a finger-emphasized multi-scale descriptor (FMD) is proposed based on the invariant multi-scale

Hand gesture recognition

In this work, we explore the hand gesture recognition engines based on three different algorithms: the dynamic time warping (DTW) [32], Support Vector Machine (SVM), and the back propagation neural network (BP) [33], for various applications.

Since the representation of hand gesture is a sequence of FMD parameters, the pairwise matching algorithm is an intuitive solution, thus the DTW algorithm is a preferred choice for its ability in non-linear matching. Given two FMD sequences IA and IB with

Experiments

In this section, we evaluate the capability of the proposed method in four aspects: (1) demonstrate the robustness of our method to noises, articulated variations and rigid transformations; (2) evaluate the accuracy and efficiency of our method on the challenging hand gesture datasets by an extensive comparative study; (3) test the performance of our method in real time application of hand gesture recognition; (4) verify that the proposed FMD descriptor can be used with different classifiers

Conclusions

In this work, we propose a hand gesture recognition system using the Kinect sensor. A finger-emphasized multi-scale descriptor is proposed for hand gesture representation, which is robust to noises, hand articulations and rigid transformations. We perform three hand gesture recognition methods by DTW, SVM, and BP, respectively. Extensive experiments on the benchmark hand gesture datasets validate the robustness, accuracy and efficiency of our method. The proposed descriptor can be flexibly

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC No. 61773272), the Six Talent Peaks Project of Jiangsu Province, China (No. XYDXX-053), and the Suzhou Key Industry Technology Innovation-Prospective Application Research Project, Jiangsu, China (Grant No. SYG201711).

References (36)

  • G.R.S. Murthy et al.

    A review of vision based hand gesture recognition

    Int. J. Inf. Technol. Knowl. Manag.

    (2009)
  • J. Shotton et al.

    Real-time human pose recognition in parts from single depth images

    Proc. IEEE CVPR

    (2011)
  • S. Belongie et al.

    Shape matching and object recognition using shape contexts

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • H. Ling et al.

    Shape classification using the inner-distance

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2007)
  • X. Bai et al.

    Path similarity skeleton graph matching

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2008)
  • K. Siddiqi et al.

    Hamilton Jacobi skeletons

    Int. J. Comput. Vis.

    (2002)
  • Z. et al.

    Robust part-based hand gesture recognition using kinect sensor

    IEEE Trans. Multimedia

    (2013)
  • Z. Ren et al.

    Minimum near-convex shape decomposition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • Cited by (31)

    • Learning full context feature for human motion prediction

      2023, Journal of Visual Communication and Image Representation
    • Applying deep neural networks for the automatic recognition of sign language words: A communication aid to deaf agriculturists

      2021, Expert Systems with Applications
      Citation Excerpt :

      The major disadvantage of the conventional machine learning techniques lies in choosing the appropriate feature descriptors. Feature extraction is not embedded as a part of these classification models, and a long trial and error process is needed to decide which features best describe different classes of gestures (Huang & Yang, 2021; Kowdiki & Khaparde, 2021). Conventional approaches to HGR often fail for realistic applications due to many challenging factors like segmentation and tracking of hands from complex and uncontrolled backgrounds, derivation of discriminative feature descriptors, dimensionality reduction and feature selection (Pramod & Martin, 2015), elimination of movement epenthesis (Elakkiya, 2020; Neena & Geetha, 2020) etc.

    View all citing articles on Scopus
    View full text