当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
arXiv - CS - Sound Pub Date : 2021-01-08 , DOI: arxiv-2101.03149
Ruohan Gao, Kristen Grauman

We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce. Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video. It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement, and generalizes well to challenging real-world videos of diverse scenarios. Our video results and code: http://vision.cs.utexas.edu/projects/VisualVoice/.

中文翻译:

VisualVoice:具有跨模态一致性的视听语音分离

我们介绍了一种视听语音分离的新方法。给定视频,目标是尽管同时发出背景声音和/或其他人类说话者,但仍要提取与面部相关的语音。现有的方法着重于学习说话者的唇部运动和它们产生的声音之间的对齐方式,我们建议先利用说话者的脸部外观,再分离出可能产生的相应声音质量。我们的方法可以从未标记的视频中共同学习视听语音分离和跨模式说话人嵌入。它在五个基准数据集上产生了最新的结果,用于视听语音分离和增强,并且很好地概括了具有挑战性的各种场景的真实视频。我们的视频结果和代码:http://vision.cs.utexas。
更新日期:2021-01-11
down
wechat
bug