Fine-grained talking face generation with video reinterpretation,The Visual Computer

当前位置： X-MOL 学术 › Vis. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fine-grained talking face generation with video reinterpretation
The Visual Computer ( IF 3.0 ) Pub Date : 2020-09-22 , DOI: 10.1007/s00371-020-01982-7
Xin Huang , Mingjie Wang , Minglun Gong

Generating a talking face video from a given audio clip and an arbitrary face image has many applications in areas such as special visual effects and human–computer interactions. This is a challenging task, as it requires disentangling semantic information from both input audio clips and face image, then synthesizing novel animated facial image sequences from the combined semantic features. The desired output video should maintain both video realism and audio–lip motion consistency. To achieve these two objectives, we propose a coarse-to-fine tree-like architecture for synthesizing realistic talking face frames directly from audio clips. This is followed by a video-to-word regeneration module to translate the synthesized talking videos back to the words space, which is enforced to align with the input audios. With multi-level facial landmark attentions, the proposed audio-to-video-to-words framework can generate fine-grained talking face videos that are not only synchronous with the input audios but also maintain visual details from the input face images. Multi-purpose discriminators are also adopted for adversarial learning to further improve both image fidelity and semantic consistency. Extensive experiments on GRID and LRW datasets demonstrate the advantages of our framework over previous methods in terms of video quality and audio–video synchronization.

中文翻译：

带有视频重新解释的细粒度说话人脸生成

从给定的音频剪辑和任意人脸图像生成说话人脸视频在特殊视觉效果和人机交互等领域有许多应用。这是一项具有挑战性的任务，因为它需要从输入音频剪辑和面部图像中分离语义信息，然后从组合的语义特征中合成新颖的动画面部图像序列。所需的输出视频应保持视频真实感和音频-嘴唇运动的一致性。为了实现这两个目标，我们提出了一种从粗到细的树状架构，用于直接从音频剪辑合成逼真的人脸帧。然后是视频到单词的再生模块，将合成的谈话视频转换回单词空间，强制与输入音频对齐。通过多级面部标志性注意力，所提出的音频到视频到单词的框架可以生成细粒度的人脸视频，这些视频不仅与输入音频同步，而且还保留了输入人脸图像的视觉细节。对抗性学习还采用了多用途鉴别器，以进一步提高图像保真度和语义一致性。在 GRID 和 LRW 数据集上的大量实验证明了我们的框架在视频质量和音视频同步方面优于以前的方法。对抗性学习还采用了多用途鉴别器，以进一步提高图像保真度和语义一致性。在 GRID 和 LRW 数据集上的大量实验证明了我们的框架在视频质量和音视频同步方面优于以前的方法。对抗性学习还采用了多用途鉴别器，以进一步提高图像保真度和语义一致性。在 GRID 和 LRW 数据集上的大量实验证明了我们的框架在视频质量和音视频同步方面优于以前的方法。

更新日期：2020-09-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文