Introduction

Human–robot interaction (HRI) based on a command line requires professionals to operate the robot, whereas HRI based on graphical user interface has brought non-expert users a lot of convenience. But these two methods do not meet the requirements of natural interaction and they have hindered the application of natural interaction between human and robots. To solve this problem, some researchers made a few attempts to introduce human communication into human–computer interaction [1]. Hand gestures can express much information and provide intuitive, natural, and effective interaction between human and robot. Hence, hand posture recognition and gesture-based interaction play significant roles in the human–robot interaction, and have attracted increasing attention [2].

Gesture recognition techniques can be divided into two categories: contact-based techniques and vision-based techniques [3, 4], depending on whether physical interaction between the user and device occurs. Contact-based hand gesture recognition technology is widely used for interactions via sensors such as electromyography, an inertial measurement unit, a data glove, multi-touch screen [3,4,5]. Although a contact-based system can obtain higher recognition rate and precision, users are required to wear the specific device while performing the hand gesture recognition. In turn, vision-based hand gesture recognition utilizes information captured by cameras such as monocular cameras, stereo cameras, color-depth (RGB-D) cameras, and it can provide a convenient and natural interface without requiring any wearable devices [2, 3], and thus be widely used [2, 6, 7].

Although depth cameras have been used for several years in computer vision, the high price and poor-quality limit their application. With the release of low-cost RGB-D camera Kinect by Microsoft, the application of gesture recognition is broadened, because the RGB-D camera can provide high-quality depth images. The RGB-D cameras, such as Microsoft Kinect, and ASUS Xtion PROLIVE, can provide RGB, depth, and skeleton information. Some researchers use only depth [8] information, and some researchers use both RGB and skeletal information [9], and some researchers utilize both RGB and depth information [10, 11] to recognize hand postures.

Generally, hand gestures are classified into static and dynamic gestures [12]. Hand shapes are static gestures, i.e., hand postures, while hand movements are dynamic gestures [13]. Generally, there are two steps to recognize a hand posture, i.e., hand segmentation and hand posture recognition, when conventional RGB cameras are used. Hand segmentation is the initial step in the process of hand posture/gesture recognition, where the image region with hands is detected and segmented from the image. Some specific features such as skin color [14,15,16,17,18], shape [19, 20], both skin color and shape [21,22,23], motion [24], a combination of motion, skin color and edge [25] information are typically used for segmentation using conventional RGB cameras. The methods using motion information are optical flow [26], frame difference [27], background subtraction [28, 29], etc. The optical flow method has a wide range of application. However, it is complex and hard to meet real time requirement. The frame difference method uses the difference between two consecutive images to detect moving objects [30]. The background subtraction method applies to static background scenes. And it has good real-time quality, but poor adaptability to abrupt changes of environment.

Due to the stable characteristics and translation invariance, skin color detection is frequently used to segment the hand region. Moreover, hand detection/segmentation that utilizes skin color information is easy to be implemented and requires little computation. But it is sensitive to variable illumination. In addition, the color of skin-like objects in the background is quite similar to hand, so hands can not be segmented accurately only using skin color information, when there is skin-like background. In this condition, skin information can be combined with other information, such as gradient, texture and histograms information, to detect hands accurately. The hand detection method using shape information usually requires to train a classifier using texture, histograms, edges, gradient information, etc. Thus, it is not sensitive to variable illumination. However, it usually involves higher computational complexity. The hand segmentation method using both skin color and shape information reduces the calculation quantity and improves the detection rate and reliability, comparing with the hand detection method using only shape information. However, it is influenced by variable illumination to some extent.

After the segmentation of hands from the image, hand postures/gestures will be recognized. Several methods were proposed to recognize hand postures/gestures. Some researchers used hidden Markov model (HMM) [31], dynamic time warping (DTW) [32] methods, etc. to recognize hand gestures. In addition, some researchers employed machine learning methods to recognize hand postures. Moreover, some researchers utilized histogram of oriented gradients (HOG) [33], local binary patterns (LBP) [34], HOG, means, variances and Haar [35] features, and AdaBoost classifiers [36,37,38,39] to classify hand postures. Some researchers used spatial histogram coding of nonsubsampled contourlet transform coefficients [23], Gabor [40], local histogram [41], multiple kernels [42], saliency map, Gabor and pyramid histogram of oriented gradients [43], and support vector machine (SVM) [44, 45] to classify hand postures. These existing methods consistently apply machine leaning methods to train a model using the data which is prepared in advance. Then the trained model is utilized to recognize hand postures. Although it is required enough time to train the model, these methods are effective for classification once the model has been trained. Furthermore, some researchers utilized template matching algorithm [46,47,48] to recognize hand postures. Several studies that applied the geometric method [15, 16, 49,50,51,52,53] to recognize hand postures. These methods are simple and have no need of a pre-trained model. Moreover, based on edge detection, they often employ shape information for classification. However, edge detection is susceptible to noises, distortions, etc. Xiuhui Wang et al. [54] extracted the length and width of each finger, the number of fingers, the angle between wrist and each finger and the skin color as hand posture features to recognize hand postures using an extended genetic algorithm. Based their previous work, Yanqiu Liu et al. [55] added a set of concentric circles with centre at the centre of palm, then the number of concentric circles hitting the outlines of segmented hand region is extracted as hand posture feature. Next, a linear discriminant analysis algorithm was utilized to deal with these vectors. Finally, a weighted k-nearest neighbour algorithm is developed for hand posture recognition. Malima et al. [16] proposed an efficient method to recognize the hand posture by counting the number of zero-to-one (black-to-white) transitions in the circle with the centre at the centre of gravity (COG) of the hand region, and a radius at 70% of the farthest distance in the hand region from the COG. However, it is difficult to recognize the hand posture when different postures have the same number of stretched fingers. Ju et al. [21] divided hand posture into two categories: a fist and an open palm. And the open palm is recognize using the method of Malima et al. [16].

We found the work of Malima et al. [16] is much needed in our application scenario of interaction of human and our hexapod robot. However, this work is not well applied to our current application, so it is necessary to improve this work to make it suit our application scenario, this is the starting point of our work.

This work is motivated by the study of Malima et al. [16], where the authors classified hand postures with a different number of stretched fingers that denote different hand postures. However, they did not consider some hand postures with the same number of stretched fingers, that denote different hand postures. We focus mainly on the recognition of different hand postures that have the same number of the stretched fingers, and developed new hand shape distribution features to distinguish hand postures, particularly in the cases where the same number of stretched fingers represents different hand postures.

To solve the problem and effectively recognize the hand posture, we developed a method of hand posture recognition using low-level edge features. Because segmented hand region is a closed area, polar coordinates are employed to describe the feature. The original point of the polar coordinate is the centroid of the hand region. After the features are extracted from the hand regions, the multiclass SVM classifier is employed to recognize the hand posture.

In addition, we focus on tasks of reconnaissance, rescue etc. in public security applications, based on characteristics of our integrated leg–arm hexapod robot. When reconnaissance tasks are performed, only hand postures are used to control the movements of the robot, because of the concealment of reconnaissance. Furthermore, robots can execute known tasks in a structured environment. However, for unknown tasks in an unstructured environment, which is often required by public security personnel, cannot be easily fulfilled. In this case, human knowledge and experiences are utilized, a supervised pattern and demonstration pattern are combined, and a method of linkage of movement and manipulation of robots is proposed based on hand posture recognition.

Our robot has multiple ways of movement, such as “3 + 3” gait, “2 + 4” gait, and “1 + 5” gait. Additionally, the robot has two manipulators, and different tools can be installed in the end effectors of the manipulators, such as a clamp and scissors. Thus, it requires multiple interactivities. Moreover, it needs natural interaction between a human and the robot. The difficulty of linkage of multiple movements and manipulations of the robot is how to design an interaction system based on hand posture recognition, so that public security personnel can perform tasks conveniently, naturally and efficiently. To solve the problem, we combine a supervision pattern and demonstration pattern, and propose an interactive method to achieve tasks of reconnaissance, rescue, etc. Specifically, tasks that require a robot to perform are mainly divided into two categories: regular tasks and complex tasks. Regular tasks are achieved by robots, such as move forward, backward, left and right. While complex tasks, such as unknown tasks in an unstructured environment, cannot be accomplished by the robot autonomously. A good solution is learning from demonstration. For example, hand postures are utilized to control movements of the robot. In addition, videos captured by the camera installed on the robot is used to supervise the operations of the robot, to make certain that the robot perform the desired action.

The main contributions of the proposed method are listed as follows:

  1. 1.

    To classify some hand postures that have the same number of stretched fingers, while representing different hand postures, this study presents a new hand shape distribution feature, motivated by the study of Malima et al. [16]. The experimental results demonstrate the effectiveness of the proposed feature.

  2. 2.

    To reduce the effect of illumination, a new CbCr-I component Gaussian mixture model (GMM) is developed to detect skin regions in this case. Furthermore, a new adaptive threshold is presented to reduce false detection and misdetection of skin pixels. In addition, a hand segmentation method using shape and position information is presented based on the CbCr-I component GMM. The segmentation results demonstrate the effectiveness of the CbCr-I component GMM, the adaptive threshold and the proposed segmentation method.

The remainder of this paper is organized as follows: the next section describes the proposed approach in detail, including hand segmentation, hand shape distribution feature extraction and posture recognition. The subsequent section provides the related experiments and results. In the final section, the conclusions of our work are summarized.

Proposed approach

Users are usually not experts in robotics, and natural interactions between human and robots are required. Because the reconnaissance task has features of concealment and dangerous, it requires no sound for interaction. And hand postures are natural, intuitive and non-verbal, so they are chosen for reconnaissance and rescue tasks in this case. In addition, knowledge gained and learned from humans is transferred to the proposed hand posture system to enhance HRI. Specifically, hand postures are regarded as graphics, and their maps are regarded as knowledge representations. Moreover, hand postures are used to control the movements/manipulations of the leg–arm hexapod robot in this study.

Table 1 Mapping movements/manipulations of the hexapod robot
Fig. 1
figure 1

Framework of proposed approach

Fig. 2
figure 2

Framework of proposed hand segmentation approach

The proposed interaction system based on hand posture recognition is designed to enable the robot to perform reconnaissance, rescue, and counterterrorism tasks. First of all, several types of hand posture were designed according to daily communication-related actions, based on the requirements of the tasks and the characteristics of our robot. Furthermore, a mapping is predefined from a hand posture to the corresponding motion/manipulation of the robot. Specifically, ten kinds of hand postures are designed, and five kinds of hand postures are used to control movements of the robot, i.e., moving forward, backward, left, right and stop. While the remaining five types of hand postures are utilized to control manipulations of the robot, i.e., opening scissors, closing scissors, opening a clamp, closing a clamp, changing positions of a manipulator. And part of the mapping is shown in Table 1. Once the robot stops, and the hand posture which can change positions of a manipulator is recognized, the hand postures which control movements of the robot are utilized to control movements of a manipulator, to make the manipulator reach a specific position. Then images of hand postures were captured to form our data set. And proposed method of hand posture recognition is used to train our model to recognize the hand postures.

The proposed method of hand posture recognition consists of two stages, i.e. the training stage and the testing stage, which are illustrated in the flowchart of Fig. 1. Hand regions are segmented in the training stage, after skin regions are detected based on the CbCr-I component GMM. To distinguish between the face and hand, prior knowledge on the position is used to segment the hand region. The shape features are extracted from the hand region and serve as input vectors to build a multiclass SVM classifier model in the training stage, which is used to recognize hand posture in the testing stage.

Hand segmentation based on the CbCr-I component Gaussian mixture model

The hand region is segmented from the image in the hand segmentation step, which is the initial also essential step for hand posture recognition. The flow diagram of hand segmentation is shown in Fig. 2. First of all, the input image was pre-processed. The pre-processing consists of bilateral filtering and transformation from normalized RGB to YCbCr and YIQ color spaces. Bilateral filtering is applied for remaining edge and denoising. Subsequently, skin regions are detected using the proposed CbCr-I component GMM. Afterwards, the number of skin connected components at the bottom of the image is calculated. If this is equal to zero, the hand was segmented using the long sleeves method. In contrast, if the number of skin connected components at the bottom of the image was greater than zero, the hand was segmented applying the short sleeves method.

Skin detection based on the CbCr-I component GMM

The primary step to detect skin color regions from the image is to choose a color space. To reduce the dependence on lighting, normalized RGB color spaces are applied [56]. The YCbCr color space can be used to detect skin due to its discreteness and clear separation of chrominance and luminance components [57]. Moreover, among the components of YIQ color space, the I-channel of the YIQ color space involves colors from orange to cyan, which makes it possible to detect skin pixels [58]. To detect skin color pixels more accurately, both Cb and Cr components of the YCbCr color space and the I-component of the YIQ color space are utilized in this study to establish the skin color model.

A Gaussian mixture probability density function is the sum of weighted individual Gaussian probability density functions, and thus GMM is more capable of approximating a complex distribution. When the user interacts with the robot using hand postures, the scene, position and background are unknown. To detect the skin areas more accurately, a new CbCr-I component GMM is proposed in this case, which is expressed as follows:

$$\begin{aligned}&P(X|\varTheta )=\sum _{i=1}^{k} \alpha _{i}N_{i}(X|\mu _{i},\varSigma _{i}) \end{aligned}$$
(1)
$$\begin{aligned}&\sum _{i=1}^{k}\alpha _{i}=1,0 \le \alpha _{i} \le 1 \end{aligned}$$
(2)
$$\begin{aligned}&\varTheta =\{\alpha _{1},\ldots ,\alpha _{k},\mu _{1},\ldots ,\mu _{k},\varSigma _{1},\ldots ,\varSigma _{k} \} \end{aligned}$$
(3)
$$\begin{aligned}&X=\{x_{1},\ldots ,x_{N}\}, \end{aligned}$$
(4)

where \(N_{i}(X|\mu _{i},\varSigma _{i})\) denotes the ith Gaussian function, and \(\mu _{i}\), \(\varSigma _{i}\) are the mean and covariance matrix of the ith Gaussian function, respectively. The parameters of GMM are defined as \(\varTheta \), and k is the number of Gaussian probability density functions; \(\alpha _{i}\) is the weight of the ith Gaussian probability density function. X is the set of the pixel value; N is the number of pixels in the image; \(x_{j} \) \(( 1\le j \le N )\) is the jth pixel value of the Cb, Cr, and I components, which are obtained by means of converting from the normalized RGB color space into YCbCr and YIQ color spaces.

The expectation–maximization algorithm is adopted to estimate GMM parameters, and initial parameters of GMM are calculated by the k-means clustering technology. CbCr-I component GMM is established using the skin regions manually cut from hand posture images in our data set. To reduce calculation time, three Gaussian components were selected in this study.

The skin region detection steps are shown in Fig. 2. First of all, the input image was pre-processed. Several pre-processing methods were proposed, such as mean filtering, median filtering, and bilateral filtering. Sebastián Salazar-Colores et al. [59] presented an effective single image dehazing method by modifying the dark channel prior, which greatly reduces the recurrent artifacts presented using the ordinary dark channel. In this work, pre-processing consists of bilateral filtering and transformation from normalized RGB to YCbCr and YIQ color spaces. Bilateral filtering is applied for remaining edge and denoising. The probability that each pixel belongs to a skin pixel is calculated using the trained CbCr-I component GMM. A fixed threshold may cause false detection or misdetection, as a skin pixel value is influenced by illuminance, background, etc. To solve this problem, a new adaptive threshold is presented in this case.

The process of skin detection by humans motivated us in this study. When a human detects skin regions from an unknown image, the overall situation of the image is evaluated, and skin regions are detected in detail, as our intuition. We use this process of skin detection by humans as a reference to set the adaptive threshold. Because the mean and middle values of the image and the difference between the high and mean values of the image can describe the overall condition of the image to some extent, the adaptive threshold is defined as Eq. (5).

$$\begin{aligned} t_\mathrm{{a}}=a t_{\mu }+b t_\mathrm{{m}}+c t_\mathrm{{s}}, \end{aligned}$$
(5)

where \(t_\mathrm{{a}}\) is the adaptive threshold, \(t_{\mu }\) is mean value of the image, \(t_\mathrm{{m}}\) is the middle value of the image, \(t_\mathrm{{s}}\) is the difference between the high and mean values of the image, and abc are empirical coefficients.

The probability that each pixel belongs to a skin pixel is calculated using the trained CbCr-I component GMM, then the adaptive threshold is applied to decide whether a pixel belongs to a skin pixel. In addition, morphological operations are applied, and skin regions are detected. When a human interacts with the robot using hand postures, neither the face nor the hand is on the left, right, top and bottom edge of the image, which is an assumption we make in our study. If there are connected components on the top, left, or right edge of the image, these will be removed.

Hand region segmentation

After the detection of skin regions, the hand region is segmented from the image. We assumed that when the user is gesturing, one hand is raised to gesture maintaining, the hand at a lower position than the top of the head, whereas the other hand is drooped. Moreover, it assumed that the captured image includes both the face and hand, and the position of the face is higher than the position of the gesturing hand, which is above the bottom edge of the image. When the user wears long sleeves, the number of skin regions on the bottom edge is equal to zero, while the number of skin regions on the bottom edge is greater than zero when the user wears short sleeves. When the user wears long sleeves, the hand region is segmented using the relative position information of the face and hand, i.e., the hand being lower than the face. Based on this assumption, the model automatically obtains the second lower skin area as the detected hand region. When the user wears short sleeves, the gesturing arm is detected using the prior position information, and the hand is segmented at the wrist using its shape feature.

Although there are various hand postures because of various hand joints, the shape of the arm remains basically unchanged. Branch points are present on the palm, whereas there are no branch points on the arm. Moreover, the thickness of the forearm from the elbow to wrist becomes generally gradually smaller, and the thickness of the hand from wrist to palm widens suddenly. We can detect the wrist by shape information, and thus the hand region is segmented from the position of the wrist. The hand segmentation method is described in Algorithm 1.

figure a

Hand shape distribution feature

Motivation

Some researchers employed HOG [45], HOG and LBP [38], SURF and Hu moment invariant features [60] as features to recognize hand postures. These features commonly have higher computational complexity, and consume considerable CPU time, as these methods involve significant redundant feature information. The hand shapes provide semantic meanings of hand postures, and contours of hands are essential for the hand posture representation. Each contour of the hand posture represents only one hand posture class. This means that hand contours can clearly represent hand postures. As shown in Fig. 3, hand postures can be recognized with only hand contours, regardless of other information, such as texture and gradient. Furthermore, pixel-level features, such as pixel values, texture and gradient, need to be calculated at each pixel of the image. Thus, they require significant computation power. In this study, hand contour feature is adopted to recognize the hand posture, as the hand contour is a common feature that can be calculated without requiring other information.

Fig. 3
figure 3

Examples of hand postures. (a) Hand postures with color, texture, gradient information, etc. (b) Hand postures with only hand contour information

This work was motivated by the study of Malima et al. [16], and presents a new hand shape feature utilizing hand contour information to distinguish hand postures, especially when some hand postures have the same number of stretched fingers, whilst they represent different hand postures. Hence, the main problem in this case is the recognition of different hand postures, which have the same number of stretched fingers.

Comparing with some drawing circles methods [16] previously published, innovations of this case are listed in the following:

  1. 1.

    Drawing circles method [16] uses mainly distance or/and angle as features. Meanwhile, this work is motivated by shape context descriptor. The centroid of the segmented hand region is seen as the pole in polar coordinates, and we utilize the main direction of the segmented hand region as the reference direction. Moreover, a ray from the pole in the reference direction is seen as the polar axis. Furthermore, this work uses two-dimensional matrices which relates to shape context of hand contours as features.

  2. 2.

    Drawing circles method [16] classified hand postures when a different number of stretched fingers denote different hand postures. They did not address some hand postures that have the same number of stretched fingers, while representing different hand postures. This study mainly focuses on the latter cases, developing new hand shape features.

  3. 3.

    The features proposed in this study are different from the features in drawing circles method [16]. Drawing circles method [16] tracked the constructed the binary circle. Thereby, they used the pixel value of the constructed circle as features. In this study, hand contour information is adopted, and the distances between the points on the contour and the centroid are calculated, and one two-dimensional matrix which relates to the shape context of the hand contour is used to denote the feature of the hand shape.

  4. 4.

    Drawing circles method [16] calculated the number of zero-to-one (black-to-white) transitions, and then the number minus one (for the wrist) produced the number of stretched fingers of the hand postures. In this study, a multiclass SVM classifier is applied to recognize the hand posture.

Definition of the hand shape distribution feature (HSDF)

Hand contours can represent hand postures, thus the hand shape distribution feature (HSDF) is introduced to recognize hand postures in this study, which is shown as Algorithm 2.

figure b

Hand regions are segmented using the skin color model mentioned above, and the segmented hand region and the detected skin region are shown in Fig. 4a and b, respectively. The edges of the segmented hand region are detected using the Canny detector [61]. The contours of the hand are closed curves, as shown in Fig. 4c. Generally, Cartesian rectangular coordinates are not suitable to represent the closed curve. Polar coordinates have natural advantages for representing the closed curve. Moreover, after the size of samples is normalized based on the length of features, the rotation invariant feature can be calculated in polar coordinates. Therefore, the features are described in polar coordinates.

We adopt the radius as a parameter to represent the contour of the hand. To calculate the radius of each point on the contour, the original point on hand regions should be set first. In our case, the centroid of the segmented hand region is adopted as the original point, because this point is located on the palm of hand, and it is hence almost at the centre of the hand. The centroid of the hand region is denoted as \(x_0, y_0\), and it is given by Eqs. (6)–(9).

$$\begin{aligned} x_0= & {} \frac{M_{10}}{M_{00}}, y_0=\frac{M_{01}}{M_{00}} \end{aligned}$$
(6)
$$\begin{aligned} M_{00}= & {} \sum _{i=1}^{N}\sum _{j=1}^{M}P(i,j) \end{aligned}$$
(7)
$$\begin{aligned} M_{10}= & {} \sum _{i=1}^{N}\sum _{j=1}^{M}i*P(i,j) \end{aligned}$$
(8)
$$\begin{aligned} M_{01}= & {} \sum _{i=1}^{N}\sum _{j=1}^{M}j*P(i,j), \end{aligned}$$
(9)

where P(ij) is the pixel value of the pixel (ij).

The red point in Fig. 4c is the centroid of the hand region, which is almost at the centre of the hand. The light blue curve depicts the contour of the hand region, and it can be seen clearly that the contour can roughly represents the hand posture. The distances between the centroid and the points on the hand contour are calculated as the radius of the points on the contour by Eq. (10).

$$\begin{aligned} R_{x,y}=\sqrt{(x_0-x)^2+(y_0-y)^2}. \end{aligned}$$
(10)
Fig. 4
figure 4

Segmented hand region and its corresponding features. a Segmented hand region. b Skin detection result of segmented hand region. c Hand contour. d Distances between the centroid and the points on the hand contour

First, the distances between the centroid and the points on the contour are calculated clockwise. As shown in Fig. 4, the profile of the curve in Fig. 4d has a roughly corresponding relationship with the contour of hand regions in Fig. 4c. Meanwhile, the crests of curve depict the fingertips of contour of hand regions, as shown in Fig. 4d.

Selection of the origin

The selection of the origin is important to describe the features in our method, which can affect the representation of features of the hand contour to a large extent. However, our proposed method is robust with respect to the origin selection as long as the selected origin is on the palm.

Figure 5 shows the four classes of origin and the corresponding radius curves. The number of peaks and their location in the curves does not change, even when the locations of the original points are different. Hence, the features of the hand posture do not change with origin locations, which means the proposed feature extraction method for hand posture recognition is robust with respect to changes of the origin. This is reflected in the same tendency of the hand contour, where the feature value and location of fingertips remained at the local maxima, i.e., the number of peaks and their location in the curves remained unchanged.

Fig. 5
figure 5

Hand contours and their features

Posture recognition

To deal with small-sample classification problem, the multiclass SVM classifier is employed to recognize the ten classes of hand postures. The proposed hand posture recognition method involves two stages, i.e., training stage and testing stage. In the training stage, skin color regions are detected using the CbCr-I component GMM, after which morphological operations, prior position, and shape information are utilized to segment the hand region, which contains only the detected hand posture. Then the proposed features are extracted from the hand regions, and they are fed into the multiclass SVM classifier as input vectors to build the multiclass SVM classifier model. In the testing stage, the multiclass SVM classifier model is used to classify the hand posture. In addition, SVM is a binary classifier, and it is extended to recognize m-class postures. We use so called one against one strategy, i.e., \(m(m-1)/2\) classifiers are needed to train. Moreover, we choose radial basis function as the kernel function in this paper.

Because the hand sizes are different, the lengths of hand contours are also different, which may lead to a failed recognition. To deal with this problem, the lengths of hand contours need to be resized. \(L_\mathrm{{o}}\) and L denote the length of the original hand contour and the length of final hand contour, respectively. If \(L_\mathrm{{o}}>L\), the signals are re-sampled to obtain the final length of the hand contour; If \(L_\mathrm{{o}}<L\), then linear interpolation is used to resize the hand contour. The process of re-sampling or interpolation S is defined by Eq. (11). If \(S(n-1)\) is not an integer, then the final hand contour will be computed by Eq. 12, and if \(S(n-1)\) is an integer, the obtained interpolation will be \(l_\mathrm{{r}}^n=l_\mathrm{{o}}^{S(n-1)+1}\).

$$\begin{aligned} S= & {} \frac{L_\mathrm{{o}}-1}{L-1} \end{aligned}$$
(11)
$$\begin{aligned} l_\mathrm{{r}}^n= & {} \frac{\lceil S(n-1)+1\rceil -(S(n-1)+1)}{\lceil S(n-1)+1\rceil -\lfloor S(n-1)+1 \rfloor } l_\mathrm{{o}}^{\lceil S(n-1)+1\rceil }\nonumber \\&+ \frac{(S(n-1)+1)-\lfloor S(n-1)+1\rfloor }{\lceil S(n-1)+1\rceil -\lfloor S(n-1)+1 \rfloor } l_\mathrm{{o}}^{\lfloor S(n-1)+1\rfloor }, \end{aligned}$$
(12)

where \(l_\mathrm{{r}}^n\) is the nth point on the result curve; \(l_\mathrm{{o}}^n\) is the nth point on the original curve; \(\lceil *\rceil \) presents the ceil of the fractional number \(*\), and \(\lfloor *\rfloor \) is defined as the floor of the fractional number \(*\).

Fig. 6
figure 6

Several sample images of our data set

Fig. 7
figure 7

Skin region detection and hand region segmentation in conference room. a, d Original images. b, e Results of skin region detection based on the CbCr-I component GMM. c, f Results of hand region segmentation

Fig. 8
figure 8

Skin region detection and hand region segmentation in the lab. a, d Original images. b, e Results of skin region detection based on the CbCr-I component GMM. c, f Results of hand region segmentation

Fig. 9
figure 9

Skin region detection and hand region segmentation outdoors. a, d Original images. b, e Results of skin region detection based on the CbCr-I component GMM. c, f Results of hand region segmentation

Fig. 10
figure 10

Hand region segmentation of a hand posture image with user wearing short sleeves. a Original images. b Segmentation results of gesturing arm. c Center line of the gesturing arm. d Center line of the forearm. e Horizontal distance curve. f Gradient of the horizontal distance curve. g Approximate location of the wrist. h Results of hand region segmentation

Fig. 11
figure 11

Skin region detection and hand region segmentation of hand posture images with user wearing short sleeves. a, d Original images. b, e Segmentation results of gesturing arm. c, f Results of hand region segmentation

Table 2 Confusion matrix for ten hand postures using our dataset in the conference room
Table 3 Confusion matrix for ten hand postures using our dataset in the lab
Table 4 Confusion matrix for ten hand postures using our dataset outdoors

Experimental results

A dataset was constructed to evaluate the performance of the proposed method of hand posture recognition. The dataset consists of 1200 images with ten classes of postures. Three scenes are presented in our dataset, which include the conference room, laboratory, and outdoor. Four-hundred hand postures are captured in each scene. There are ten types of hand postures in each scene, including both the left and right hand. The number of each class of hand postures of each hand in each scene is 20, and each type of hand posture is different with respect to the position, distance, and rotation. Because we want to utilize hand postures to control the movements of our multifunctional mobile robot, these postures are captured by a laptop camera, which can be used to remotely operate the robot. The size of the image in our data set is \(1290\times 720\) pixels. We attempt to design our hand posture recognition system from human centred viewpoint. When the user interacts with the robot through hand postures, the user looks toward the camera and can see the captured image. The user gestures in his/her natural way of making himself/herself comfortable. If both face and the gesturing hand are connected in the captured image, even eyes of the user are blocked by the gesturing hand, the user will feel uncomfortable. So, it is assumed that the user’s hand is always away from the face in the captured image. When the user is gesturing, they generally turn their palm to the camera, thus it is assumed that the palm of the postures is facing the camera. Some of the images in our dataset are shown in Fig. 6.

Table 5 Comparisons of our approach with other methods for ten hand postures in the conference room

Hand segmentation

Hand region segmentation of hand postures with long sleeves

After the detection of skin regions using the skin color model mentioned above, the prior knowledge that the position of the face is higher than hand is employed, i.e., the hand being lower than the face. Here, we assume that the gesturing hand is lower than face, as in most of general cases. Based on this assumption, the model automatically obtains the second lower skin area as the detected hand region. In other words, the hand region is segmented as a region of interest (ROI) from the image. The result of hand segmentation is shown in Figs. 7, 8, and 9. Our segmentation algorithm performs well in demonstrating the effectiveness of the proposed method.

Hand region segmentation of hand postures with short sleeves

The proposed hand segmentation algorithm is employed to segment short-sleeved hand postures. Figure 10 shows the hand region segmentation results of a short-sleeved hand posture image.

Table 6 Comparisons of our approach with the other methods for ten hand postures in the lab
Table 7 Comparisons of our approach with the other methods for ten hand postures outdoors
Table 8 Training time of different methods for hand postures in different scenes

The proposed segmentation algorithm was utilized to segment ten types of hand postures with short sleeves, and some hand region segmentation results are shown in Fig. 11. We can see that the proposed segmentation method can segment the hand region from both images of users wearing long and short sleeves, which demonstrate the effectiveness of the proposed segmentation algorithm.

Multiclass hand posture recognition

The presented method was evaluated in our dataset. Our dataset comprises ten types of hand posture images in the conference room scene, in the lab and outdoors. There are 1200 images for the three scenes in total. Ten types of hand postures are presented in each scene, and each type of hand posture consists of both 20 left hand images and 20 right hand images. Furthermore, the 18 left-hand images and 18 right-hand images of each type of hand postures in each scene are used to train the multiclass SVM classifier, while the remaining two images for each side are used to test the trained model. Hence, 360 images are used for training, and 40 images are used to test the trained classifier in three scenes.

The process of hand posture recognition consists of training and testing stages, as shown in Fig. 1. The training procedure is described as follows. First, the hand region is segmented from the image. Then, the proposed features are extracted from the hand region, and sent as input vectors to the multiclass SVM classifier to build the model. In the testing stage, after the hand region is segmented from the image, the features are extracted from the segmented image that contains the hand posture only. Finally, the features are fed into the multiclass SVM classifier model to recognize the hand posture.

Experiments are performed using tenfold cross validation on a computer with Intel\(\circledR \) Xeon(R) Gold 6254 CPU @ 3.10GHz. The performance of the hand posture recognition in the conference room, lab, and outdoors are shown in Tables 23 and 4, respectively. The proposed algorithm has good performance, and the average accuracies of hand posture recognition are 92.75%, 91.75% and 93.25%, respectively, which demonstrates the effectiveness of the presented approach.

The hand postures, like the hand postures in Fig. 6e, g and l, have two stretched fingers, however, they depict different hand postures, and they can be recognized using the proposed method. However, these postures cannot be recognized by the method that recognizes the hand posture by counting the number of zero-to-one (black-to-white) transitions of the constructed the binary circle with the center at the COG of the hand region, and the radius equal to 70% of the farthest distance in the hand region from the COG [16]. Moreover, some hand postures with the same number of stretched fingers, that denote different hand postures, have been recognized using the proposed hand posture recognition method in the conference room, in the lab, and outdoors, respectively, which demonstrates the effectiveness of the proposed hand posture recognition approach.

To evaluate the performance of the proposed method, some comparison experiments were carried out. The performance comparison is shown in Tables 567 and 8. The average accuracies of the proposed method in the conference room, lab, and outdoors are 92.75%, 91.75% and 93.25%, respectively. Whereas the average accuracies of the HOG and SVM method [44] in the conference room, lab, and outdoors are 93.5%, 92.75%, and 95.5%, respectively. The HOG and SVM [44] and the proposed method have a similar recognition accuracy. The dimension of the proposed feature is smaller than HOG. In contrast with HOG, the proposed feature has a dimension of \(1\times 60\) rather than \(1\times 20736\). Thus, the proposed method significantly reduces calculation. The average accuracies of the proposed method in the conference room, lab, and outdoors are 92.75%, 91.75% and 93.25%, respectively. Whereas the average accuracies of LeNet-5 [62] in the conference room, lab, and outdoors are 82.75%, 80.50% and 86.25%, respectively. In most cases, the accuracies of the proposed method are higher than LeNet-5 [62] method, perhaps because the number of samples is small. The average accuracies of the proposed method in the conference room, lab, and outdoors are 92.75%, 91.75% and 93.25%, respectively. Whereas the average accuracies of ResNet-18 [63] in the conference room, lab, and outdoors are 95.5%, 96.5% and 93.0%, respectively. The proposed method and ResNet-18 [63] method have a similar recognition accuracy. However, the training time of the proposed method is lower than ResNet-18 [63] method.

Conclusions

A novel method of hand posture recognition based on a new hand shape distribution feature and the CbCr-I component GMM skin color detection is presented. To reduce the effect of variable illumination, the CbCr-I component GMM is proposed. Moreover, the hand region is segmented as a region of interest using the CbCr-I component GMM and the adaptive threshold in the segmentation step. Hand regions can be segmented from both images of user wearing long and short sleeves. Subsequently, a new hand shape distribution feature was proposed based on the hand contour described in polar coordinates in the recognition step. This feature only utilized hand contour information to represent hand postures. The proposed feature is robust to the location of the origin, when the origin is located on the palm of the hand. Finally, a multiclass SVM classifier is employed to recognize hand postures. Because the contours of the hand regions are closed curves, the proposed method can deal with the false problem in some shape-based methods. To evaluate the performance of the proposed method, we built a dataset and conducted experiments of hand posture recognition to test the model. The experimental results showed the effectiveness of the proposed method, demonstrating that this algorithm can recognize hand postures, particularly in the cases where the same number of stretched fingers represents different hand postures.