You Said That?: Synthesising Talking Faces from Audio,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

You Said That?: Synthesising Talking Faces from Audio
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2019-02-13 , DOI: 10.1007/s11263-019-01150-y
Amir Jamaludin , Joon Son Chung , Andrew Zisserman

We describe a method for generating a video of a talking face. The method takes still images of the target face and an audio speech segment as inputs, and generates a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we develop an encoder–decoder convolutional neural network (CNN) model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on unlabelled videos using cross-modal self-supervision. We also propose methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.

中文翻译：

你说过吗？：从音频合成说话的面孔

我们描述了一种生成说话人脸视频的方法。该方法以目标面部的静止图像和音频语音片段作为输入，并生成与音频同步的目标面部唇部视频。该方法实时运行，适用于训练时看不到的人脸和音频。为了实现这一点，我们开发了一个编码器-解码器卷积神经网络 (CNN) 模型，该模型使用人脸和音频的联合嵌入来生成合成的说话人脸视频帧。该模型使用跨模态自我监督在未标记的视频上进行训练。我们还提出了通过使用多流 CNN 模型将生成的人脸视觉混合到源视频帧中来重新配音视频的方法。

更新日期：2019-02-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>