An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech
arXiv - CS - Multimedia Pub Date : 2021-06-26 , DOI: arxiv-2106.14016
Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu

Cued Speech (CS) is a communication system for deaf people or hearing impaired people, in which a speaker uses it to aid a lipreader in phonetic level by clarifying potentially ambiguous mouth movements with hand shape and positions. Feature extraction of multi-modal CS is a key step in CS recognition. Recent supervised deep learning based methods suffer from noisy CS data annotations especially for hand shape modality. In this work, we first propose a self-supervised contrastive learning method to learn the feature representation of image without using labels. Secondly, a small amount of manually annotated CS data are used to fine-tune the first module. Thirdly, we present a module, which combines Bi-LSTM and self-attention networks to further learn sequential features with temporal and contextual information. Besides, to enlarge the volume and the diversity of the current limited CS datasets, we build a new British English dataset containing 5 native CS speakers. Evaluation results on both French and British English datasets show that our model achieves over 90% accuracy in hand shape recognition. Significant improvements of 8.75% (for French) and 10.09% (for British English) are achieved in CS phoneme recognition correctness compared with the state-of-the-art.

中文翻译：

基于注意力自监督对比学习的线索语音中手形特征表示的三阶段模型

Cued Speech (CS) 是一种适用于聋人或听力障碍人士的交流系统，其中说话者使用它通过手形和位置来澄清潜在的模棱两可的嘴巴动作，从而在语音水平上帮助唇语阅读器。多模态 CS 的特征提取是 CS 识别的关键步骤。最近基于监督深度学习的方法受到嘈杂的 CS 数据注释的影响，尤其是对于手形形态。在这项工作中，我们首先提出了一种自监督的对比学习方法，在不使用标签的情况下学习图像的特征表示。其次，使用少量手动注释的 CS 数据对第一个模块进行微调。第三，我们提出了一个模块，它结合了 Bi-LSTM 和自注意力网络，以进一步学习具有时间和上下文信息的序列特征。除了，为了扩大当前有限的 CS 数据集的数量和多样性，我们构建了一个新的英式英语数据集，其中包含 5 个以 CS 为母语的人。在法语和英国英语数据集上的评估结果表明，我们的模型在手形识别方面达到了 90% 以上的准确率。与最新技术相比，CS 音素识别正确率显着提高了 8.75%（法语）和 10.09%（英式英语）。

更新日期：2021-06-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>