Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition
arXiv - CS - Multimedia Pub Date : 2021-06-25 , DOI: arxiv-2106.13686
Jianrong Wang, Ziyue Tang, Xuewei Li, Mei Yu, Qiang Fang, Li Liu

Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British English (97 sentences). In this work, we propose a cross-modal knowledge distillation method with teacher-student structure, which transfers audio speech information to CS to overcome the limited data problem. Firstly, we pretrain a teacher model for CS recognition with a large amount of open source audio speech data, and simultaneously pretrain the feature extractors for lips and hands using CS data. Then, we distill the knowledge from teacher model to the student model with frame-level and sequence-level distillation strategies. Importantly, for frame-level, we exploit multi-task learning to weigh losses automatically, to obtain the balance coefficient. Besides, we establish a five-speaker British English CS dataset for the first time. The proposed method is evaluated on French and British English CS datasets, showing superior CS recognition performance to the state-of-the-art (SOTA) by a large margin.

中文翻译：

自动线索语音识别的跨模态知识提炼方法

Cued Speech (CS) 是一种面向聋哑人或听力障碍人士的视觉交流系统。它将唇部动作与手部提示相结合，以获得完整的语音曲目。当前基于深度学习的自动 CS 识别方法存在一个常见问题，即数据稀缺。到目前为止，法语（238 个句子）和英式英语（97 个句子）只有两个公共单人说话者数据集。在这项工作中，我们提出了一种具有师生结构的跨模态知识蒸馏方法，将音频语音信息传输到 CS 以克服有限数据问题。首先，我们使用大量开源音频语音数据预训练了一个用于 CS 识别的教师模型，同时使用 CS 数据预训练了嘴唇和手的特征提取器。然后，我们使用框架级和序列级蒸馏策略将知识从教师模型提炼到学生模型。重要的是，对于帧级，我们利用多任务学习来自动权衡损失，以获得平衡系数。此外，我们首次建立了一个五人英国英语 CS 数据集。所提出的方法在法语和英国英语 CS 数据集上进行了评估，显示出优于最先进技术 (SOTA) 的卓越 CS 识别性能。

更新日期：2021-06-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>