Abstract

In this paper, a fusion method based on multiple features and hidden Markov model (HMM) is proposed for recognizing dynamic hand gestures corresponding to an operator’s instructions in robot teleoperation. In the first place, a valid dynamic hand gesture from continuously obtained data according to the velocity of the moving hand needs to be separated. Secondly, a feature set is introduced for dynamic hand gesture expression, which includes four sorts of features: palm posture, bending angle, the opening angle of the fingers, and gesture trajectory. Finally, HMM classifiers based on these features are built, and a weighted calculation model fusing the probabilities of four sorts of features is presented. The proposed method is evaluated by recognizing dynamic hand gestures acquired by leap motion (LM), and it reaches recognition rates of about 90.63% for LM-Gesture3D dataset created by the paper and 93.3% for Letter-gesture dataset, respectively.

1. Introduction

Dynamic hand gesture recognition is a very intriguing problem in recent years that, if efficiently solved, could be the wealthiest means of communication that can be used. Because of this, many scholars from all over the world have done a lot of theoretical and practical research studies [1]. Compared with static gestures, the meaning of dynamic gestures is more abundant, and it is more common and natural to be an interactive way. But, at the same time, the information of dynamic hand gestures, such as shape and location, varies as time, which consequently increases the difficulty in recognition.

At present, there are two main types of sensors that are capable of sensing hand gestures: wearable sensor or vision-based sensor [2, 3]. The former approach could capture the movement of hands and fingers at the expense of convenience and cost and sufficiently extract information of hand, but it places an additional burden on users and could feel unnatural enough to perform hand gestures. Some advantages of a vision-based sensor are it can be less cumbersome and has more natural interaction than the wearable sensor due to no physical contact with users. However, its computational complexity is quite high for hand detecting, tracking, and extracting [4]. For instance, a hand should be separated from the background before the final recognition, which can be significantly affected by external environmental factors like ambient light. On the contrary, due to the complex 3D movements of hands or fingers, it is difficult to properly understand the performed hand pose based on the extracted information from 2D images [5]. Besides, once the palm surface is not parallel to the camera, for example, the recognition work could be harder.

The classification is a crucial step to recognize hand gestures. Five main classifying methods of hand gesture based on 3D vision can be identified: support vector machines (SVMs), artificial neural network (ANN), template matching (TM), HMM, and dynamic time warping (DTW) [4]. The SVM is a popular classifier for hand gesture recognition, in which support vectors are used to determine the hyperplane to realize the maximum separation of the hand gesture classes [6]. In vision-based hand gesture recognition systems, the ANN is used as a classifier to handle only fundamental and limited hand gestures [7]. When the high-level discriminative 3D hand features are available, the TM is an excellent choice for recognizing hand gestures, which works quite well with the contour- or boundary-based hand features [8]. As the hand gesture is a continuous pattern concerning time, the HMM is found to be the most suitable pattern recognition tool for testing on a moderately large dataset [9]. DTW is an indirect continuous hand gesture recognition approach that automatically aligns the sequences with different lengths and returns the proper distance [10].

Martin Sagayam and Jude Hemanth [11] develop a probabilistic model based on the state sequence analysis in the HMM to recognize hand gestures taken from the Cambridge hand dataset. The experimental results show that the proposed method achieves a 0.98% reduction in error rate and a 1.55% improvement in the recognition rate over that of the Viterbi prediction. Some work combines HMM with other methods for gesture recognition. Zhou et al. [12] use HMM to model the different information sequences of dynamic hand gestures and use BP neural network (BPNN) as a classifier to process the resulting hand gestures modeled by HMM, which achieves a satisfactory real-time performance and an accuracy above 84%. Martin Sagayam and Jude Hemanth [13] propose a hybrid 1D HMM model with artificial bee colony (ABC) optimization. The method is carried out with nine different classes of hand gestures that are used for virtual reality applications. The experimental results show that the average value of the recognition rate with ABC optimization increases by 2.72%, and the average value of the error rate is decreased by 0.47%.

With the emergence and development of deep learning technology, some scholars try to apply the technology for hand gesture recognition. Oyedotun and Khashman [14] apply a convolutional neural network (CNN) and stacked denoising autoencoder (SDAE) to recognize 24 American Sign Language (ASL) hand gestures obtained from a public database, which achieves the recognition rates of 91.33 and 92.83%. Bao et al. [15] propose a deep CNN that can classify hand gestures from the whole image without any segmentation or detection stage information. The method can organize seven sorts of hand gestures in a user-independent manner and achieve an accuracy of 97.1% in the dataset with simple backgrounds and 85.3% in the dataset with complex backgrounds.

In recent years, 3D sensors, such as binocular cameras, Kinect, and LM, have been applied for hand gesture recognition with excellent performance. LM can detect and track hands and fingers with an accuracy of about 0.01 mm and feedback the gesture information in real time with a sampling rate of 120 fps [16]. Because of its superior performance, many researchers consider that it is a promising 3D sensor and particularly suitable for hand gesture recognition. For instance, Chen et al. [17] extract directional codes of 3D motion trajectory as the feature and exploit a classifier based on SVM to classify letter and number gestures. Ameur et al. [18] extract the positions of fingertips and palm center as features that are then trained with an SVM classifier. Their method reaches an average recognition rate of about 81% with 11 kinds of dynamic gestures. Xu et al. [19] and Zeng et al. [20] also conducted similar studies. Besides t, some researchers are working on dynamic gesture recognition. Lu et al. [21] build two kinds of features and feed them into the hidden conditional neural field classifier to recognize dynamic gestures. Avola et al. [22] propose a long short-term memory (LSTM) and recurrent neural networks (RNNs) combined with an effective set of discriminative features based on both joint angles and fingertip positions to recognize sign language and semaphoric hand gestures, which achieves an accuracy of over 96%. Vamsikrishna et al. [9] propose a low-cost computer-vision-assisted setup based on LM to detect precise movements of palm or finger within the field of view of the sensors. Then, it presents a set of discrete HMM for classifying the gesture sequences performed during rehabilitation.

The paper is aimed at recognizing the hand gestures corresponding to an operator’s hand commands in robot teleoperation. For the problem, the paper develops four feature vectors and their extraction models based on 3D information acquired by LM to describe the hand gestures. And then, the article establishes HMMs to calculate the occurrence probabilities of four feature sequences in an unknown hand gesture, respectively. Lastly, the paper uses a weighted algorithm to fuse the occurrence probabilities of four features. The most considerable hazard is taken as can be taken as a recognition result. The rest of the paper is organized as follows. Prophase works of hand gesture recognition are introduced in Section 2. The methods of feature extraction are presented in Section 3, including valid dynamic gesture judgment, feature definition, and feature sequence clustering. HMM training model and hand gesture recognition by fusing the feature probabilities are proposed in Section 4. Section 5 comprises experiments and the result and discussion. Conclusion and possible future extensions are given in Section 6.

2. Prophase Work of Gesture Recognition

2.1. Leap Motion and Data Acquisition

LM, based on time-of-flight technology, mainly consists of three infrared LEDs and two infrared cameras, which can take photos from different directions to obtain gesture information in 3D space [16]. LM has about 150 degrees view field and an effective range of approximately 0.03 to 0.06 meters above itself. LM could feedback data frames that consist of positions and velocities of key points, rotation information, and frame timestamp.

When collecting gestures, LM will establish a right-hand coordinate system, as shown in Figure 1, based on all obtained data such as position, speed, and gesture of human hands. As shown in Figure 1, the five fingertips are denoted by , and palm center is denoted by C. We mainly focus on the following data: (1) palm normal vector and palm direction vector , which represent unit vectors perpendicular to the palm plane and point from the palm position toward the fingers, respectively; (2) finger direction vector and the finger extension length points , which represent the unit vector pointing to the point of the finger point and the distance between two points, respectively; (3) instantaneous velocity of five fingertips and instantaneous velocity of the palm center; and (4) coordinate , which represents the coordinate of the palm position in the frame .

2.2. Dynamic Gesture Definition

There are relatively few publicly available hand gesture datasets created by LM-sampled images, especially for dynamic hand gestures in robot teleoperation. We analyze the movement characteristics of the operator’s hand command in the robot teleoperation, such as translation and rotation of three degrees of freedom, and create a gesture dataset named LM-Gesture3D, which contains eight different dynamic gestures, as shown in Table 1. All these gestures collected by LM represent some practical operations or command signs and can be performed easily and naturally. Besides, there are similarities among the gestures in some respects, which will be illuminated in more detail later.

3. Feature Extraction

3.1. Valid Dynamic Gesture Judgment

Despite the fact that LM has many merits, it mainly acts as a gesture data collector similar to a wearable device and camera. Hence, conditions for judging the beginning and the end of a valid dynamic gesture need to be given first. Take LM-Gesture3D as an example; it can be seen that the fingertips and palm center will inevitably produce rapid and continuous displacement when either gesture is performed. Even for a simplest dynamic gesture, click, for example, is no exception. A simple discriminant, based on the above analysis, is established as follows:where and are the instantaneous velocity of palm center and fingertips, respectively, and is the predefined velocity threshold.

When the total number of continuous frames up to 60, and , satisfy discriminant (1), the data frames will be regarded as the original data of a valid dynamic gesture.

As LM is quite sensitive, in both cases when hand makes a slight shaking at rest and the obtained data contain noise, discriminant (1) could be satisfied in a few consecutive frames. So the total number (i.e., 60 frames) is set to eliminate these useless data. In addition, dynamic gesture with a low speed will be judged as invalid by discriminant (1), which means there is a degree of freedom for hand movement.

3.2. Feature Definition

To effectively recognize dynamic gestures, changes in hand posture and position are analyzed separately. The former can be further divided into the bending angle of fingers, opening perspective between fingers, and palm posture. The gesture trajectory can be represented later. Therefore, the paper describes the changes in gestures through the above four features.

The specific extraction process and expression of the four features are as follows.

3.2.1. Palm Attitude Feature

If the palm shape changes little in a dynamic gesture, the change in palm posture can be regarded as the problem of attitude angle calculation of a rigid body. The paper draws lessons from the 3D attitude measurement method, which is pointed out in [23].

As shown in Figure 1, the palm posture in the 3D space at any time could be uniquely determined by palm normal vector and palm direction vector . Let , then a new coordinate system can be obtained to represent the palm posture in frame t. We take the initial data frame of the dynamic gesture as the fixed coordinate system and denote it as . So, the change in palm posture between the current frame and the first frame can be represented with three Euler angles:

3.2.2. Bending Angle of Fingers

As we mainly focus on the bending angle of the finger, the thickness of the finger could be regarded as useless, and then, each finger can be simplified to a planar model, as shown in Figure 2. Based on the two models, Hong et al. [24] propose a method to estimate the hand’s attitude or instead bending angle of fingers and coordinates of joint points. At all conditions, their method require a merely total length of the finger , visible length of the finger , and several constraint constants. Combining with their research, we define the finger bending angle aswhere can be obtained directly from LM and equals to when the finger is straight.

In equation (3), is used for normalization in order to make the approach robust to people with hands of different sizes. For , a simple method is proposed to calibrate before data acquisition. The user keeps his/her palm plane parallel to LM and open fingers as straight as possible. When data of total continuous frames satisfy (1) and (2) , up to 30, the obtained visible lengths of five fingers could be recorded as total lengths, where is the component of normal vector along the Y-axis direction in the LM coordinate system.

3.2.3. Opening Angle of Fingers

The other descriptor for the fingers is the opening angle between fingers. As mentioned above, every single finger can be modeled on a plane. Thus, the problem of computing the angle between two fingers can convert to one calculating the angle between two planes. Here, the plane consists of , and is taken as the benchmark plane in the computation. Let and be the normal vector of the benchmark plane and finger planes, respectively. So, the opening angle can be calculated as follows:

3.2.4. Trajectory Feature

A specific and meaningful trajectory usually accompanies some dynamic gestures, such as circling with a finger (like G5). So, the paper considers the path of the dynamic gesture and extracts a simplified feature for gesture recognition. When LM works, it can detect the palm center’s return space coordinates with high accuracy and stability. So the moving trajectory of a hand can be expressed by a series of discrete points. The paper projects the gesture trajectory onto the LM’s principal gesture plane, i.e., the XOZ plane. The detailed feature extraction processes are as follows:(1)Let be the discrete points of the 2D gesture trajectory, then the central point of these points can be expressed as follows:(2)Any point and form a vector of together with the central point as the starting point. Then, the norm of and the direction angles between and the X-axis can be represented as follows:(3)Norm of the vectors is normalized with the maximum norm , thus obtaining . Besides, direction angles of the vectors are converted into codes according to the angular regions, as shown in Figure 3. and can be computed as follows:

Before coding the direction angle , we change the coordinate system from the original LM one into the coordinate system, as shown in Figure 3(a), the axis of which always points from the central point to the first point . The obtained trajectory feature and are of scale and rotation invariance based on the operation plane.

Select typical data once for each gesture in the LM-Gesture3D dataset and build their feature diagrams, as shown in Figure 4. Each row in Figure 4 corresponds sequentially to one of the gestures in the LM-Gesture3D. Four descriptions in each row from left to right are palm posture, finger bending angle, finger opening angle, and trajectory, respectively. It is not hard to see that each feature diagram depicts how its corresponding gesture is performed nicely. Gesture with complicated changes usually corresponds to complex feature curves, and vice versa. Different gestures may have similar features. The palm posture feature of G1–G3, for example, is similar to that of G6–G8 finger bending angle, and finger opening angle of G6–G8 is similar to each other. Therefore, it is not easy to distinguish these gestures just with a single feature. Of course, there are some gestures with significantly different features like G1 and G2. So, there is no misrecognition between G1 and G2.

There may be some more distinguishing features that can improve the recognition rate as well as reduce the computation cost for a given gesture. However, considering eight kinds of gestures in LM-Gesture3D that have obvious similarities, we prefer to select a feature set with completeness and redundancy that meets the requirements of unified modeling and recognizes the gestures. According to the description of Figure 4, some features in the defined feature set are similar to each other for different gestures. However, there are some distinct features that are also included in the defined feature set. So, on the whole, the collected LM-Gesture3D or other more kinds of dynamic gestures can be adequately represented and distinguished by the defined four types of features.

In all four features, finger bending angle and finger opening angle are not affected by acquisition direction. To verify whether the rest two kinds of features are rotation invariance, we obtain the hand data of the gesture G6 from an experimenter, who is asked to make the gesture G6 twice during the collecting period. Then, we extract the posture feature and trajectory feature from the collected hand data and draw the feature curves, as shown in Figure 5.

3.3. Feature Sequence Clustering

As shown in Table 2, in a single data frame of a gesture, four features can be represented by dimensional vectors, respectively. Accordingly, each feature in data frames of a dynamic gesture forms dimensional vector sequences. In order to build the model of discrete HMM, K-means algorithm [25] is used to cluster the feature vector in the sequence. After clustering a feature vector into class, the feature vector sequence can be expressed as , where indicates that the feature vector is closest to the cluster center numbered . In the paper, the cluster number of four kinds of features is shown in Table 2.

In short, we take the discrete feature sequence composed of cluster tags as inputs of the discrete HMM. Therefore, both the sample data for HMM training stage and the gesture data for HMM recognizing need to go through the steps of feature extraction and clustering.

4. Gesture Modeling and Recognition

4.1. Recognizing Flow

The recognizing process of gesture is shown in Figure 6, which can be divided into two parts. The first part deals with the accurate gesture segmentation and four features extraction and quantification. The second part includes HMM model training and gesture recognition, both of which are based on the premise of feature sequences extraction.

The formal features of HMM can be expressed with a 5-tuple , where is a finite set Markov chain state, and N is the number of states; is a finite set of observation symbols, and M is the number of symbols. is the matrix of state transition probability, is the matrix of observation probability, and is the initial state probability distribution.

4.2. HMM Training

Unlike common one HMM for one kind of gesture modeling pattern, we build one HMM model for each feature, which means that 4 HMM models are adopted to achieve the recognition of each performed unknown gesture. Taking LM-Gesture3D for example, the designed 8 gestures are denoted by ; then, for the feature sequence ( ) of gesture , the following HMM modeling processes are carried out:(1)HMM initialization: according to Table 1, in the paper, N is set to be 6. The number of observation symbols M is set as the same value of the number of cluster centers shown in Table 2; the initialization model parameters are described as .(2)HMM parameters revaluation: assume that the feature sequence consists of observation sequences , where , and each observation sequence could be represented as .

For computing , , and , respectively, the observation sequence and the original model parameter are substituted into the reestimation equations as follows:where .

Thus, a new model is obtained. The above process would be repeated until the parameters in two adjacent iterations meet as follows:where is calculated from the forward-backward algorithm, which indicates the occurrence probability of the observation sequence under the parameter , and is the predefined convergence threshold.

The final model parameter is the optimal parameter of feature sequence , that is, the single feature HMM of its corresponding gesture. By repeating the above modeling process for each feature sequence of 8 dynamic gestures, we can obtain 32 single-feature HMM models in total.

4.3. Gesture Recognition with HMM Fusion

In the stage of gesture recognition, once original data of an unknown and valid dynamic gesture are obtained, it would be first extracted into 4 observation sequences . Then, the forward-backward algorithm is used to calculate the occurrence probability of the observation sequence under 8 single feature HMM . Similarly, the occurrence probability of the observed sequence under their corresponding HMM can be obtained. For demonstration purposes, we represent the occurrence probabilities as .

We present an algorithm of weighted probability fusion to compute the probability that an unknown gesture belongs to the gesture in LM-Gesture3D as follows:where ( and ) is the weight of feature corresponding to the gesture .

According to equation (10), there are 8 calculation results, in which the maximum is regarded as the recognition result of the unknown gesture.

The paper employs least square method (LSM) to determine in equation (10). Here is a brief introduction to the LSM weight method. Firstly, we calculate the probabilities of four features for all samples in the training dataset and can obtain , where is the number of samples and . Secondly, for the gesture in LM-Gesture3D, if the sample belongs to it, we set the probability of the sample corresponding to the gesture as follows:

Else, the probability of the sample corresponding to the gesture is set to be as follows:where is a set probability.

Calculating the probabilities of all samples corresponding to the gesture , we can obtain the following formula:

We use the least square method (LSM) to compute in equation (15). Finally, is normalized to . is the weight vector of the fusion model.

5. Experiments

To test the performance of the proposed method, several experiments are carried out on a desktop PC with an Intel Core i5-3230M processor and 4 Gb of RAM, and the software environment consists of Visual Studio 2013, Leap Motion SDK 2.3.1 + 3154, and MATLAB 2012a.

5.1. LM-Gesture3D Recognition Experiment

We select four participants with certain experiences in robot teleoperation to join the experiment. Each participant is asked to imitate each gesture in LM-Gesture3D 40 times repeatedly, and LM samples their gestures. So, there are 160 samples of each gesture.

To verify the feasibility of the proposed method, we define the recognition rate as follows:where is the number of gestures correctly recognized and is the total number of gestures recognized.

Firstly, we use K-fold cross-validation to evaluate the recognition performance and stability of the proposed method. In this experiment, K is set to be 10. So, each subset has 128 samples. Figure 7 is the result of K-fold cross-validation, which shows that the recognition rates of different trained models range from 89.8% to 92.9%. The fluctuation ranges of recognition rates of all 10 trained HMM models are about 3%, which shows that the proposed method has a good generalization ability. The average recognition rate of all 10 trained HMM models is about 90.8%, which indicates that the proposed method has a good recognition performance.

Furthermore, we analyze the recognition performance of the proposed method for different types of gestures in LM-Gesture3D. We randomly select 60 samples of each gesture as the testing set and the remaining samples as the training set. Table 3 shows the recognizing results. From the table, we can see that our method has a good representation of the 8 dynamic gestures with the average recognition rate of about 90.6%. The recognition rates for all gestures fluctuate slightly between 88.3% and 91.7%. The recognition rates of G4 and G6–G8 are higher than those of G1–G3 and G5. The reason is that these gestures are relatively simpler and easier for different users to repeat, while the participant’s individual habits easily influence G1–G3 and G5. In addition, gestures G1–G3 are easily confused with G6–G8, respectively.

In general, the recognition results are jointly determined by four kinds of features, and our method based on multiple features and HMM can represent most kinds of complex gestures, which proves that our method is effective.

5.2. Dynamic Gesture Recognition Experiments

This experiment mainly tests our method's recognition rate for two kinds of relatively simple dynamic gestures, which are named letter-gesture dataset and the waving-gesture dataset, respectively. As shown in Figure 8(a), letter-gesture set consists of 6 gestures numbered 1 to 6, which are similar to each other. The waving-gesture dataset contains the rest 6 gestures shown in Figure 8(b). It can be seen that the main feature of two gesture sets is trajectory feature and palm posture feature, respectively.

The gestures in the experiment are sampled from four participants. Each participant is asked to repeat each gesture 50 times. When collecting the letter-gesture dataset, each participant keeps the shape as unchanged as possible and parallel to the horizontal plane of LM. Each gesture's obtained data are further divided into 120 sets of training samples and 80 sets of testing samples.

Chen et al. [17] propose a rapid early recognition system based on SVM to achieve multiclassification among the 36 dynamic gestures (the 3D motion trajectory of the numbers and the alphabet). Chen’s method uses LM to capture 3D motion trajectories of the gestures, which is the same as our method. In Chen’s method, the orientation angle is utilized as a unique feature of the gesture trajectory projected into the XOZ plane. It is quantized by dividing it by 45° and coded from 1 to 9, which is similar to our method. Chen’s method is also used to recognize the gestures in the letter-gesture dataset.

Figure 9 shows the recognition results of our method and Chen’s method. Our method and Chen’s method get the average recognition rates of 96.0% and 93.5%, respectively. Two approaches have very similar recognition rates. However, the fluctuation of our method's recognition rate with LSM weights is smaller than that of Chen’s approach. It shows that our method has better recognition stability than Chen’s method.

In addition, the directional code extracted by Chen’s method is determined by two neighboring points on the trajectory. In contrast, that of our method is determined by the trajectory points and the central point. At the same time, we also introduce a distance feature. Therefore, the extracted trajectory feature by our method is not affected by the amplitude of the gesture and is of rotation invariance.

Based on the above analysis, we believe that our method performs better than Chen’s method.

The waving direction of gestures 7–10 in the waving-gesture dataset is from upper right to lower left, from upper left to lower right, from top to bottom, and from bottom to top, respectively. And other gestures 11 and 12 make roughly 90° clockwise and counterclockwise rotations, respectively. It can be seen that this kind of dynamic gesture could be distinguished easily once using palm posture features. We carry an experiment to test the recognition performance of our method aiming at the 6 kinds of gestures. In the experiment, the method of data acquisition and processing is the same as that in the experiment of the letter-gesture dataset.

Pan et al. [26] present a combination method based on rule-based classification and SVM recognizes the gestures, which also use LM to capture real-time frame data of hand motion and define a 14-dimensional feature set including the absolute pose of hand in the 3D coordinate system and the pose changes in the hand between the two frames. Pan’s method is also used to recognize the gestures in the waving-gesture dataset.

Figure 10 shows the recognition results of our method and Pan’s method. The recognition rates of two methods for gestures 7–12 are all over 90, and the average recognition rates are 90.4% and 90.8%, respectively. The average recognition rate of Pan’s method is slightly higher than that of our method.

Compared with our method, Pan’s method will lead to more computational costs because it selects high dimension features and adopts a two-step recognizing strategy. Our method has not only a high recognition rate but also has the rotational invariance for selecting the rotation angle based on the initial posture of hand as features. Our method has a good effect on recognizing the wave or rotation gestures, such as those in the waving-gesture dataset.

In addition, all three methods above use LM to sample the gestures. The data of the features defined by three methods can be obtained quickly and accurately by LM. But adopting the camera approach, we have to depend on the hand area feature to recognize the gestures, which is more complex and challenging. Hence, we can conclude that LM brings excellent benefits to our research.

5.3. Generalization Experiment

A generalization experiment is carried out to verify the adaptability of our method to nonstandard gestures. We select four inexperienced participants for the experiment. In the experiment, each participant is asked to repeat each gesture from LM-Gesture3D 40 times. A total of 1280 different gestures are sampled, which are recognized by the built HMM mode and the same weights in the LM-Gesture3D recognition experiment. The average recognition rate of 90.5% shown in Figure 11 is very similar to that of the LM-Gesture3D recognition experiment. So, the method is adaptable to different nonstandard gestures and has a good generalization ability.

We defined positive prediction value (PPV) and accuracy (ACC) of the gesture Gi (i = 1, 2, …, 8) as follows:where is the number of gesture Gi correctly recognized and is the number of other seven gestures that are incorrectly recognized as Gi.where is the number of gesture Gi that is incorrectly recognized as gesture Gj.

Table 4 shows the confusion matrix of the generalization experiment using the proposed method. According to Table 4, except for G5 with a PPV of about 0.96, the recognition precisions for the other seven gestures have a small difference ranging from 0.89 to 0.91.

5.4. Comparison Experiment with Other HMM-Based Methods

Here, we compare the recognition performance of the proposed method with other recognition methods based on HMM.

The authors in [11] define three features, including handshape, palm trajectory, and distance from the camera to extract the hand model from image features. It proposes a combinatorial method based on HMM and BPNN. The HMM-BPNN method uses the classical HMM to evaluate and decide the dynamic gesture features and, then, uses the BP neural network to classify the input state sequence.

In this experiment, the experimental samples are from the LM-Gesture3D recognition experiment in Section 5.1, from which 60 samples of each gesture and the remaining samples are randomly selected as the testing and training sets, respectively. The experiment is divided into two parts, including the feature testing and algorithm testing.

The feature testing experiment uses the features defined in the paper [12] to describe the gestures and analyze the HMM-BPNN method's recognition rate. Table 5 shows the recognition results of the experiment. From the table, we can see that the HMM-BPNN method has an average recognition rate of only about 50.83% for 8 dynamic gestures. Moreover, for different types of gestures, its recognition rate fluctuates greatly. The main reason for the low recognition rate of the HMM-BPNN method for the gestures in LM-Gesture3D is that the three types of 2D features defined by the method are only suitable for representing simple and highly differentiated gestures but cannot fully represent complex and highly similar gestures, such as G5.

The algorithm testing experiment uses the features defined by our method to describe the gestures and analyze the recognition rate of the HMM-BPNN method again. Table 6 shows the recognizing results of the experiment.

From Table 6, we can see that the HMM-BPNN method has an average recognition rate of about 80.83% for 8 dynamic gestures. The recognition rate of the experiment is 30% higher than that of the feature testing experiment. Moreover, for different types of gestures, its recognition rate fluctuates less. The results show that the paper's features can more effectively represent complex gestures in LM-Gesture3D than that of the HMM-BPNN method.

For the same gesture samples and the same defined features, the recognition rate of our method, shown in Table 3, is more than 90%, which is about 10% higher than that of the HMM-BPNN method. We think there are two main reasons for the relatively low recognition rate of the HMM-BPNN method. Firstly, the input of the BPNN classifier is decided by a maximum assessment of the probabilities of the trained HMM models of four types of features, which does not consider the interference between similar features. Secondly, the BP neural network is prone to fall into local minima, which increases the risk of misrecognition when different sample features have significant similarities.

6. Conclusion

In the paper, a fusion recognition method based on multiple features and HMM for the dynamic gesture is proposed. We consider both the change in handshape and moving trajectory and build four sorts of hand features with the advantages of being straightforward, simple, and rotation invariance, which bring better operation naturalness and flexibility for the operators. What is more, it offers a further expansion of more kinds of complex dynamic gestures by using these features. For each feature, we have built its corresponding HMM. In the recognition stage, we innovatively present a weighted fusion algorithm to calculate the occurrence probabilities and get the final recognition result. In the above way, the result is not easily affected by a particular feature.

The experimental results show that the proposed method is suitable for relatively simple dynamic gestures like letter gestures and waving gestures. Still, it also has strong robustness for complex dynamic gestures like LM-Gesture3D. The average recognition rate of the proposed method for LM-Gesture3D is up to 90.6%. Besides, the average recognition rate for inexperienced participants is about 90%. These results demonstrate the usability and feasibility of the proposed method.

Like other gesture recognition methods, the proposed method inevitably has certain limitations, and a more in-depth study needs to be carried out. Firstly, as we have adopted four HMMs for each gesture recognition, the algorithm’s efficiency remains to be raised. Secondly, we have not yet done more research on the adaptive weight method and their further impact on the recognition rate, which will also be a future research direction.

Data Availability

The research library related to the dissertation will be established in GitHub (https://github.com/glchenwhut), where you can access the folders and find experimental data and lists.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Guoliang Chen conceived the idea, designed the experiments, and wrote the paper. Kaikai Ge helped with the algorithm and to analyze the experimental data.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant no. 61672396.