Unsupervised domain adaptation for lip reading based on cross-modal knowledge distillation,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised domain adaptation for lip reading based on cross-modal knowledge distillation
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2021-12-11 , DOI: 10.1186/s13636-021-00232-5
Yuki Takashima ₁ , Ryoichi Takashima ₁ , Ryota Tsunoda ₁ , Tetsuya Takiguchi ₁ , Yasuo Ariki ₁ , Ryo Aihara ₂ , Nobuaki Motoyama ₂

Affiliation

We present an unsupervised domain adaptation (UDA) method for a lip-reading model that is an image-based speech recognition model. Most of conventional UDA methods cannot be applied when the adaptation data consists of an unknown class, such as out-of-vocabulary words. In this paper, we propose a cross-modal knowledge distillation (KD)-based domain adaptation method, where we use the intermediate layer output in the audio-based speech recognition model as a teacher for the unlabeled adaptation data. Because the audio signal contains more information for recognizing speech than lip images, the knowledge of the audio-based model can be used as a powerful teacher in cases where the unlabeled adaptation data consists of audio-visual parallel data. In addition, because the proposed intermediate-layer-based KD can express the teacher as the sub-class (sub-word)-level representation, this method allows us to use the data of unknown classes for the adaptation. Through experiments on an image-based word recognition task, we demonstrate that the proposed approach can not only improve the UDA performance but can also use the unknown-class adaptation data.

中文翻译：

基于跨模态知识蒸馏的唇读无监督域适应

我们提出了一种用于唇读模型的无监督域适应 (UDA) 方法，该模型是一种基于图像的语音识别模型。当自适应数据包含未知类别（例如词汇外单词）时，大多数传统 UDA 方法无法应用。在本文中，我们提出了一种基于跨模态知识蒸馏 (KD) 的域自适应方法，其中我们使用基于音频的语音识别模型中的中间层输出作为未标记自适应数据的老师。由于音频信号包含比唇图像更多的识别语音的信息，因此在未标记的自适应数据由视听并行数据组成的情况下，基于音频的模型的知识可以用作强大的老师。此外，由于提出的基于中间层的 KD 可以将教师表达为子类（子词）级表示，因此该方法允许我们使用未知类的数据进行自适应。通过基于图像的单词识别任务的实验，我们证明了所提出的方法不仅可以提高 UDA 性能，而且还可以使用未知类别的自适应数据。

更新日期：2021-12-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文