Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision
arXiv - CS - Sound Pub Date : 2020-07-08 , DOI: arxiv-2007.04134
Abhinav Shukla, Stavros Petridis, Maja Pantic

The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio). The visual pretext task drives the audio representations to capture information related to lip movements. This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality. Our method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and significantly outperforms other methods at learning from fewer labels. Notably, our method also outperforms fully supervised training, thus providing a strong initialization for speech related tasks. Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.

中文翻译：

通过联合视听自我监督从原始音频中学习语音表示

音频和视觉模式之间的直观交互对于跨模式自监督学习很有价值。这个概念已经在视频动作识别和声学场景分类等通用视听任务中得到了证明。然而，对于视听语音的自我监督仍未得到充分探索。我们提出了一种从原始音频波形中学习自我监督语音表示的方法。我们通过将纯音频自我监督（通过预测信息性音频属性）与视觉自我监督（通过从音频生成说话人脸）相结合来训练原始音频编码器。视觉借口任务驱动音频表示以捕获与唇部运动相关的信息。这用视觉信息丰富了音频编码器，并且编码器可以在没有视觉模态的情况下用于评估。我们的方法在已建立的孤立词分类基准上相对于现有的自监督音频特征获得了有竞争力的性能，并且在从更少的标签中学习方面明显优于其他方法。值得注意的是，我们的方法也优于完全监督的训练，从而为语音相关任务提供了强大的初始化。我们的结果证明了视听语音中多模态自我监督在学习良好音频表示方面的潜力。

更新日期：2020-07-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文