Multi modal spatio temporal co-trained CNNs with single modal testing on RGB–D based sign language gesture recognition,Journal of Computer Languages

当前位置： X-MOL 学术 › J. Comput. Lang. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi modal spatio temporal co-trained CNNs with single modal testing on RGB–D based sign language gesture recognition
Journal of Computer Languages ( IF 1.7 ) Pub Date : 2019-04-22 , DOI: 10.1016/j.cola.2019.04.002
Sunitha Ravi , Maloji Suman , P.V.V. Kishore , Kiran Kumar E , Teja Kiran Kumar M , Anil Kumar D

Extracting hand movements using single RGB video camera for sign language recognition is a necessary attribute in developing an automated sign language recognition system. Local spatio temporal methods has shown encouraging outcomes for hand extraction using color cues. However, the color intensities does not behave as an independent entity during video capture in real environments. This has become a roadblock in the development of sign language machine translator for processing video data in real world environments. Not surprisingly, the result is more accurate when additional information is provided in the form of depth for sign language recognition in real environments. In this paper, we make use of a multi modal feature sharing mechanism with a four-stream convolutional neural network (CNNs) for RGB – D based sign language recognition. Unlike the multi stream CNNs, where output class prediction is based on independently operated two or three modal streams due to scale variations, we propose a feature sharing multi stream CNN on multi modal data for sign language recognition. The proposed 4 – stream CNN divides into two input data groupings under the training and testing spaces. The training space uses four inputs: RGB spatial in main stream and depth spatial, RGB and depth temporal on Region of interest mapping (ROIM) stream. The testing space uses only RGB and RGB temporal data for prediction from the trained model. The ROIM stream shares the multi modal data to generate ROI maps of the human subject, which are used to regulate the feature maps in RGB stream. The scale variations in the three streams is managed by translating the depth map to fit the RGB data. Sharing of multi modal features with RGB spatial features during training has circumvented overfitting on RGB video data. To validate the proposed CNN architecture, the accuracy of the classifier is investigated with RGB-D sign language data and three benchmark action datasets. The results show a remarkable behaviour of the classifier in handling missing depth modal during testing. The robustness of the system against state – of – the – art action recognition methods is studied using contrasting datasets.

中文翻译：

基于RGB-D的手势语手势识别的多模式时空时域共同训练CNN与单模式测试

使用单个RGB摄像机提取手势来识别手语是开发自动手语识别系统的必要属性。本地时空方法已显示出令人鼓舞的结果，使用颜色提示进行手提取。但是，在实际环境中的视频捕获过程中，颜色强度不能充当独立实体。这已经成为开发用于在现实环境中处理视频数据的手语机器翻译器的障碍。毫不奇怪，当以深度形式提供其他信息以在实际环境中识别手语时，结果会更加准确。在本文中，我们利用具有四流卷积神经网络（CNN）的多模式特征共享机制，用于基于RGB – D的手语识别。与多流CNN不同，在多流CNN中，由于比例变化，输出类别预测基于独立运行的两个或三个模态流，我们提出了一种在多模态数据上共享多流CNN的特征，以进行手语识别。提议的4流CNN在训练和测试空间下分为两个输入数据组。训练空间使用四个输入：主流中的RGB空间和感兴趣区域映射（ROIM）流中的RGB空间和深度时间。测试空间仅使用RGB和RGB时间数据进行训练模型的预测。ROIM流共享多模式数据以生成人类对象的ROI映射，用于调整RGB流中的特征映射。通过转换深度图以适合RGB数据，可以管理三个流中的比例变化。在训练过程中将多模式特征与RGB空间特征共享可以避免对RGB视频数据过度拟合。为了验证提出的CNN体系结构，使用RGB-D手语数据和三个基准动作数据集研究了分类器的准确性。结果表明，分类器在测试过程中处理缺失深度模态时表现出惊人的行为。使用对比数据集研究了系统针对最新状态识别方法的鲁棒性。结果表明，分类器在测试过程中处理缺失深度模态时表现出惊人的行为。使用对比数据集研究了系统针对最新状态识别方法的鲁棒性。结果表明，分类器在测试过程中处理缺失深度模态时表现出惊人的行为。使用对比数据集研究了系统针对最新状态识别方法的鲁棒性。

更新日期：2019-04-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文