当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
arXiv - CS - Multimedia Pub Date : 2021-04-22 , DOI: arxiv-2104.11116
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, Ziwei Liu

While accurate lip synchronization has been achieved for arbitrary-subject audio-driven talking face generation, the problem of how to efficiently drive the head pose remains. Previous methods rely on pre-estimated structural information such as landmarks and 3D parameters, aiming to generate personalized rhythmic movements. However, the inaccuracy of such estimated information under extreme conditions would lead to degradation problems. In this paper, we propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. The key is to modularize audio-visual representations by devising an implicit low-dimension pose code. Substantially, both speech content and head pose information lie in a joint non-identity embedding space. While speech content information can be defined by learning the intrinsic synchronization between audio-visual modalities, we identify that a pose code will be complementarily learned in a modulated convolution-based reconstruction framework. Extensive experiments show that our method generates accurately lip-synced talking faces whose poses are controllable by other videos. Moreover, our model has multiple advanced capabilities including extreme view robustness and talking face frontalization. Code, models, and demo videos are available at https://hangz-nju-cuhk.github.io/projects/PC-AVS.

中文翻译:

隐式模块化视听表示的可控姿势说话脸生成

尽管已经为任意主题的音频驱动的说话脸生成实现了精确的嘴唇同步,但是仍然存在如何有效地驱动头部姿势的问题。先前的方法依赖于预先估计的结构信息(例如地标和3D参数),旨在生成个性化的节奏运动。但是,这种估计信息在极端条件下的不准确性将导致降级问题。在本文中,我们提出了一个干净而有效的框架来生成可姿势控制的说话人脸。我们仅使用一张照片作为身份参考来处理原始的面部图像。关键是通过设计隐式的低维姿势代码来将视听表示模块化。基本上,语音内容和头部姿势信息都位于联合的非身份嵌入空间中。虽然可以通过学习视听模态之间的内在同步来定义语音内容信息,但我们确定将在基于卷积的调制重建框架中互补学习姿势代码。大量的实验表明,我们的方法可以精确地生成嘴唇同步的说话脸,其姿势可以被其他视频控制。此外,我们的模型具有多种高级功能,包括极高的视图鲁棒性和会说话的人脸正面化。代码,模型和演示视频可在https://hangz-nju-cuhk.github.io/projects/PC-AVS上获得。大量的实验表明,我们的方法可以精确地生成嘴唇同步的说话脸,其姿势可以被其他视频控制。此外,我们的模型具有多种高级功能,包括极高的视图鲁棒性和会说话的人脸正面化。代码,模型和演示视频可在https://hangz-nju-cuhk.github.io/projects/PC-AVS上获得。大量的实验表明,我们的方法可以精确地生成嘴唇同步的说话脸,其姿势可以被其他视频控制。此外,我们的模型具有多种高级功能,包括极高的视图鲁棒性和会说话的人脸正面。代码,模型和演示视频可在https://hangz-nju-cuhk.github.io/projects/PC-AVS上获得。
更新日期:2021-04-23
down
wechat
bug