Multi-view motion modelled deep attention networks (M2DA-Net) for video based sign language recognition,Journal of Visual Communication and Image Representation

当前位置： X-MOL 学术 › J. Visual Commun. Image Represent. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-view motion modelled deep attention networks (M2DA-Net) for video based sign language recognition
Journal of Visual Communication and Image Representation ( IF 2.6 ) Pub Date : 2021-05-19 , DOI: 10.1016/j.jvcir.2021.103161
Suneetha M. , Prasad M.V.D. , Kishore P.V.V.

Currently, video-based Sign language recognition (SLR) has been extensively studied using deep learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In addition, using multi view attention mechanism along with CNNs could be an appealing solution that can be considered in order to make the machine interpretation process immune to finger self-occlusions. The proposed multi stream CNN mixes spatial and motion modelled video sequences to create a low dimensional feature vector at multiple stages in the CNN pipeline. Hence, we solve the view invariance problem into a video classification problem using attention model CNNs. For superior network performance during training, the signs are learned through a motion attention network thus focusing on the parts that play a major role in generating a view based paired pooling using a trainable view pair pooling network (VPPN). The VPPN, pairs views to produce a maximally distributed discriminating features from all the views for an improved sign recognition. The results showed an increase in recognition accuracies on 2D video sign language datasets. Similar results were obtained on benchmark action datasets such as NTU RGB D, MuHAVi, WEIZMANN and NUMA as there is no multi view sign language dataset except ours.

中文翻译：

用于基于视频的手语识别的多视图运动建模深度关注网络（M2DA-Net）

当前，已经使用诸如卷积神经网络（CNN）和递归神经网络（RNN）的深度学习模型对基于视频的手语识别（SLR）进行了广泛的研究。此外，将多视图注意力机制与CNN一起使用可能是一种有吸引力的解决方案，可以考虑使机器解释过程不受手指自我遮挡的影响。所提出的多流CNN将空间和运动建模的视频序列混合在一起，以在CNN管道中的多个阶段创建低维特征向量。因此，我们使用注意力模型CNN将视图不变性问题解决为视频分类问题。为了在训练过程中获得出色的网络性能，这些标志是通过运动注意网络来学习的，因此重点关注在使用可训练的视图对池网络（VPPN）生成基于视图的对池中起主要作用的部分。VPPN将视图配对以从所有视图中生成最大程度分布的区分特征，以改善符号识别。结果表明，二维视频手语数据集的识别准确性有所提高。在基准动作数据集（例如NTU RGB D，MuHAVi，WEIZMANN和NUMA）上也获得了类似的结果，因为除了我们之外，没有多视图手语数据集。结果表明，二维视频手语数据集的识别准确性有所提高。在基准动作数据集（例如NTU RGB D，MuHAVi，WEIZMANN和NUMA）上也获得了类似的结果，因为除了我们之外，没有多视图手语数据集。结果表明，二维视频手语数据集的识别准确性有所提高。在基准动作数据集（例如NTU RGB D，MuHAVi，WEIZMANN和NUMA）上也获得了类似的结果，因为除了我们之外，没有多视图手语数据集。

更新日期：2021-05-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11