当前位置: X-MOL 学术IET Image Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dynamic gesture recognition based on feature fusion network and variant ConvLSTM
IET Image Processing ( IF 2.0 ) Pub Date : 2020-09-07 , DOI: 10.1049/iet-ipr.2019.1248
Yuqing Peng 1, 2 , Huifang Tao 1, 2 , Wei Li 1, 2 , Hongtao Yuan 1, 2 , Tiejun Li 3
Affiliation  

Gesture is a natural form of human communication, and it is of great significance in human–computer interaction. In the dynamic gesture recognition method based on deep learning, the key is to obtain comprehensive gesture feature information. Aiming at the problem of inadequate extraction of spatiotemporal features or loss of feature information in current dynamic gesture recognition, a new gesture recognition architecture is proposed, which combines feature fusion network with variant convolutional long short-term memory (ConvLSTM). The architecture extracts spatiotemporal feature information from local, global and deep aspects, and combines feature fusion to alleviate the loss of feature information. Firstly, local spatiotemporal feature information is extracted from video sequence by 3D residual network based on channel feature fusion. Then the authors use the variant ConvLSTM to learn the global spatiotemporal information of dynamic gesture, and introduce the attention mechanism to change the gate structure of ConvLSTM. Finally, a multi-feature fusion depthwise separable network is used to learn higher-level features including depth feature information. The proposed approach obtains very competitive performance on the Jester dataset with the classification accuracies of 95.59%, achieving state-of-the-art performance with 99.65% accuracy on the SKIG (Sheffifield Kinect Gesture) dataset.

中文翻译:

基于特征融合网络和变体ConvLSTM的动态手势识别

手势是人类交流的一种自然形式,在人机交互中具有重要意义。在基于深度学习的动态手势识别方法中,关键是获得全面的手势特征信息。针对当前动态手势识别中时空特征提取不充分或特征信息丢失的问题,提出了一种新的手势识别架构,将特征融合网络与卷积长短时记忆(ConvLSTM)相结合。该体系结构从局部,全局和深层方面提取时空特征信息,并结合特征融合以减轻特征信息的丢失。首先,基于信道特征融合,通过3D残差网络从视频序列中提取出局部时空特征信息。然后,作者使用变体ConvLSTM来学习动态手势的全局时空信息,并介绍了注意力机制来改变ConvLSTM的门结构。最后,使用多特征融合深度可分离网络来学习包括深度特征信息在内的更高级特征。所提出的方法在Jester数据集上具有95.59%的分类精度,因此在SKIG(Sheffifield Kinect Gesture)数据集上具有99.65%的准确度,从而获得了非常出色的性能。多特征融合深度可分离网络用于学习包括深度特征信息在内的更高级特征。所提出的方法在Jester数据集上具有95.59%的分类精度,因此在SKIG(Sheffifield Kinect Gesture)数据集上具有99.65%的准确度,从而获得了非常出色的性能。多特征融合深度可分离网络用于学习包括深度特征信息在内的更高级特征。所提出的方法在Jester数据集上具有95.59%的分类精度,因此在SKIG(Sheffifield Kinect Gesture)数据集上具有99.65%的准确度,从而获得了非常出色的性能。
更新日期:2020-09-08
down
wechat
bug