当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ASR is all you need: cross-modal distillation for lip reading
arXiv - CS - Sound Pub Date : 2019-11-28 , DOI: arxiv-1911.12747
Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a large-scale audio-only corpus. We use a cross-modal distillation method that combines Connectionist Temporal Classification (CTC) with a frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that ground truth transcriptions are not necessary to train a lip reading system; (ii) we show how arbitrary amounts of unlabelled video data can be leveraged to improve performance; (iii) we demonstrate that distillation significantly speeds up training; and, (iv) we obtain state-of-the-art results on the challenging LRS2 and LRS3 datasets for training only on publicly available data.

中文翻译:

ASR 就是您所需要的:用于唇读的跨模态蒸馏

这项工作的目标是训练强大的视觉语音识别模型,而无需人工标注的地面实况数据。我们通过从经过大规模纯音频语料库训练的自动语音识别 (ASR) 模型中提炼来实现这一点。我们使用跨模态蒸馏方法,该方法将连接主义时间分类 (CTC) 与逐帧交叉熵损失相结合。我们的贡献有四方面:(i)我们表明,训练唇读系统不需要基本事实转录;(ii) 我们展示了如何利用任意数量的未标记视频数据来提高性能;(iii) 我们证明蒸馏显着加快了训练速度;并且,(iv)我们在具有挑战性的 LRS2 和 LRS3 数据集上获得了最先进的结果,仅用于公开数据的训练。
更新日期:2020-04-01
down
wechat
bug