Photorealistic Audio-driven Video Portraits.,IEEE Transactions on Visualization and Computer Graphics

当前位置： X-MOL 学术 › IEEE Trans. Vis. Comput. Graph. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Photorealistic Audio-driven Video Portraits.
IEEE Transactions on Visualization and Computer Graphics ( IF 5.2 ) Pub Date : 2020-09-17 , DOI: 10.1109/tvcg.2020.3023573
Xin Wen , Miao Wang , Christian Richardt , Ze-Yin Chen , Shi-Min Hu

Video portraits are common in a variety of applications, such as videoconferencing, news broadcasting, and virtual education and training. We present a novel method to synthesize photorealistic video portraits for an input portrait video, automatically driven by a person's voice. The main challenge in this task is the hallucination of plausible, photorealistic facial expressions from input speech audio. To address this challenge, we employ a parametric 3D face model represented by geometry, facial expression, illumination, etc., and learn a mapping from audio features to model parameters. The input source audio is first represented as a high-dimensional feature, which is used to predict facial expression parameters of the 3D face model. We then replace the expression parameters computed from the original target video with the predicted one, and rerender the reenacted face. Finally, we generate a photorealistic video portrait from the reenacted synthetic face sequence via a neural face renderer. One appealing feature of our approach is the generalization capability for various input speech audio, including synthetic speech audio from text-to-speech software. Extensive experimental results show that our approach outperforms previous general-purpose audio-driven video portrait methods. This includes a user study demonstrating that our results are rated as more realistic than previous methods.

中文翻译：

逼真的音频驱动的视频肖像。

视频肖像在各种应用中很常见，例如视频会议，新闻广播以及虚拟教育和培训。我们提出了一种新颖的方法，可以根据人的声音自动为输入的肖像视频合成逼真的视频肖像。此任务的主要挑战是将输入语音音频中的真实，逼真的面部表情幻化。为了解决这一挑战，我们采用了由几何，面部表情，照明等表示的参数化3D人脸模型，并学习了从音频特征到模型参数的映射。首先将输入源音频表示为高维特征，该高维特征用于预测3D面部模型的面部表情参数。然后，我们将根据原始目标视频计算出的表达式参数替换为预测的参数，并重新渲染已重现的面孔。最后，我们通过神经人脸渲染器从重新制定的合成人脸序列生成逼真的视频肖像。我们的方法的一个吸引人的特征是对各种输入语音音频的泛化能力，包括从文本到语音软件的合成语音音频。大量的实验结果表明，我们的方法优于以前的通用音频驱动视频肖像方法。这包括一项用户研究，证明我们的结果比以前的方法更现实。包括来自文本到语音软件的合成语音音频。大量的实验结果表明，我们的方法优于以前的通用音频驱动视频肖像方法。这包括一项用户研究，证明我们的结果比以前的方法更现实。包括来自文本到语音软件的合成语音音频。大量的实验结果表明，我们的方法优于以前的通用音频驱动视频肖像方法。这包括一项用户研究，证明我们的结果比以前的方法更现实。

更新日期：2020-11-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>