Everybody's Talkin': Let Me Talk as You Want,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Everybody's Talkin': Let Me Talk as You Want
arXiv - CS - Multimedia Pub Date : 2020-01-15 , DOI: arxiv-2001.05201
Linsen Song, Wayne Wu, Chen Qian, Ran He, Chen Change Loy

We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.

中文翻译：

每个人都在说话：让我随心所欲地说话

我们提出了一种通过将音频序列作为输入来合成照片般逼真的视频来编辑目标人像镜头的方法。这种方法是独一无二的，因为它是高度动态的。它不假设特定于个人的渲染网络能够将任意源音频转换为任意视频输出。我们没有直接学习从音频到视频的高度异构和非线性映射，而是首先通过单目 3D 人脸重建将每个目标视频帧分解为正交参数空间，即表情、几何和姿势。接下来，引入循环网络将源音频转换为主要与音频内容相关的表达参数。然后使用音频翻译的表达参数在每个视频帧中合成逼真的人类主体，嘴部区域的运动精确映射到源音频。保留了目标人物肖像的几何和姿势参数，因此保留了原始视频片段的上下文。最后，我们介绍了一种新颖的视频渲染网络和一种动态编程方法来构建时间连贯且逼真的视频。大量实验证明了我们的方法优于现有方法。我们的方法是端到端可学习的，并且对源音频中的语音变化具有鲁棒性。我们引入了一种新颖的视频渲染网络和动态编程方法来构建时间连贯且逼真的视频。大量实验证明了我们的方法优于现有方法。我们的方法是端到端可学习的，并且对源音频中的语音变化具有鲁棒性。我们引入了一种新颖的视频渲染网络和动态编程方法来构建时间连贯且逼真的视频。大量实验证明了我们的方法优于现有方法。我们的方法是端到端可学习的，并且对源音频中的语音变化具有鲁棒性。

更新日期：2020-01-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文