Transferring Cross-domain Knowledge for Video Sign Language Recognition,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Transferring Cross-domain Knowledge for Video Sign Language Recognition
arXiv - CS - Multimedia Pub Date : 2020-03-08 , DOI: arxiv-2003.03703
Dongxu Li, Xin Yu, Chenchen Xu, Lars Petersson, Hongdong Li

Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. It requires models to recognize isolated sign words from videos. However, annotating WSLR data needs expert knowledge, thus limiting WSLR dataset acquisition. On the contrary, there are abundant subtitled sign news videos on the internet. Since these videos have no word-level annotation and exhibit a large domain gap from isolated signs, they cannot be directly used for training WSLR models. We observe that despite the existence of a large domain gap, isolated and news signs share the same visual concepts, such as hand gestures and body movements. Motivated by this observation, we propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them. To this end, we extract news signs using a base WSLR model, and then design a classifier jointly trained on news and isolated signs to coarsely align these two domain features. In order to learn domain-invariant features within each class and suppress domain-specific features, our method further resorts to an external memory to store the class centroids of the aligned news signs. We then design a temporal attention based on the learnt descriptor to improve recognition performance. Experimental results on standard WSLR datasets show that our method outperforms previous state-of-the-art methods significantly. We also demonstrate the effectiveness of our method on automatically localizing signs from sign news, achieving 28.1 for AP@0.5.

中文翻译：

为视频手语识别转移跨域知识

词级手语识别 (WSLR) 是手语解释中的一项基本任务。它需要模型从视频中识别孤立的符号词。但是，注释 WSLR 数据需要专业知识，因此限制了 WSLR 数据集的获取。相反，互联网上有大量带字幕的标志新闻视频。由于这些视频没有词级注释，并且与孤立的符号存在很大的领域差距，因此它们不能直接用于训练 WSLR 模型。我们观察到，尽管存在很大的领域差距，但孤立的和新闻标志共享相同的视觉概念，例如手势和身体动作。受此观察的启发，我们提出了一种新方法，该方法可以学习领域不变的视觉概念，并通过将带字幕的新闻标志的知识转移给 WSLR 模型。为此，我们使用基本 WSLR 模型提取新闻标志，然后设计一个分类器联合训练新闻和孤立标志来粗略地对齐这两个领域特征。为了学习每个类中的域不变特征并抑制特定于域的特征，我们的方法进一步求助于外部存储器来存储对齐的新闻标志的类质心。然后我们根据学习到的描述符设计一个时间注意力来提高识别性能。在标准 WSLR 数据集上的实验结果表明，我们的方法明显优于以前的最先进方法。我们还证明了我们的方法在自动定位标志新闻中的标志方面的有效性，AP@0.5 达到 28.1。然后设计一个在新闻和孤立标志上联合训练的分类器来粗略地对齐这两个领域特征。为了学习每个类中的域不变特征并抑制特定于域的特征，我们的方法进一步求助于外部存储器来存储对齐的新闻标志的类质心。然后我们根据学习到的描述符设计一个时间注意力来提高识别性能。在标准 WSLR 数据集上的实验结果表明，我们的方法明显优于以前的最先进方法。我们还证明了我们的方法在自动定位标志新闻中的标志方面的有效性，AP@0.5 达到 28.1。然后设计一个在新闻和孤立标志上联合训练的分类器来粗略地对齐这两个领域特征。为了学习每个类中的域不变特征并抑制特定于域的特征，我们的方法进一步求助于外部存储器来存储对齐的新闻标志的类质心。然后我们根据学习到的描述符设计一个时间注意力来提高识别性能。在标准 WSLR 数据集上的实验结果表明，我们的方法明显优于以前的最先进方法。我们还证明了我们的方法在自动定位标志新闻中的标志方面的有效性，AP@0.5 达到 28.1。我们的方法进一步诉诸外部存储器来存储对齐的新闻标志的类质心。然后我们根据学习到的描述符设计一个时间注意力来提高识别性能。在标准 WSLR 数据集上的实验结果表明，我们的方法明显优于以前的最先进方法。我们还证明了我们的方法在自动定位标志新闻中的标志方面的有效性，AP@0.5 达到 28.1。我们的方法进一步诉诸外部存储器来存储对齐的新闻标志的类质心。然后我们根据学习到的描述符设计一个时间注意力来提高识别性能。在标准 WSLR 数据集上的实验结果表明，我们的方法明显优于以前的最先进方法。我们还证明了我们的方法在自动定位标志新闻中的标志方面的有效性，AP@0.5 达到 28.1。

更新日期：2020-03-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文