Towards Photo-Realistic Facial Expression Manipulation,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards Photo-Realistic Facial Expression Manipulation
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2020-08-28 , DOI: 10.1007/s11263-020-01361-8
Zhenglin Geng , Chen Cao , Sergey Tulyakov

We present a method for photo-realistic face manipulation. Given a single RGB face image with an arbitrary expression, our method can synthesize another arbitrary expression of the same person. To achieve this, we first fit a 3D face model and disentangle the face into its texture and shape. We then train separate networks in each of these spaces. In texture space, we use a conditional generative network to change the appearance, and carefully design the input format and loss functions to achieve the best results. In shape space, we use a fully connected network to predict an accurate face shape. When available, the shape branch uses depth data for supervision. Both networks are conditioned on expression coefficients rather than discrete labels, allowing us to generate an unlimited number of expressions. Furthermore, we adopt spatially adaptive denormalization on our texture space representation to improve the quality of the synthesized results. We show the superiority of this disentangling approach through both quantitative and qualitative studies. The proposed method does not require paired data, and is trained using an in-the-wild dataset of videos consisting of talking people. To achieve this, we present a simple yet efficient method to select appropriate key frames from these videos. In a user study, our method is preferred in 83.2% of cases when compared to state-of-the-art alternative approaches.

中文翻译：

走向逼真的面部表情处理

我们提出了一种照片般逼真的人脸操作方法。给定具有任意表情的单个 RGB 人脸图像，我们的方法可以合成同一个人的另一个任意表情。为了实现这一点，我们首先拟合 3D 面部模型并将面部分解为其纹理和形状。然后我们在每个空间中训练单独的网络。在纹理空间中，我们使用条件生成网络来改变外观，并精心设计输入格式和损失函数以达到最佳效果。在形状空间中，我们使用全连接网络来预测准确的人脸形状。如果可用，形状分支使用深度数据进行监督。这两个网络都以表达系数而不是离散标签为条件，允许我们生成无限数量的表达。此外，我们在纹理空间表示上采用空间自适应反规范化来提高合成结果的质量。我们通过定量和定性研究展示了这种解开方法的优越性。所提出的方法不需要配对数据，并且使用由说话的人组成的野外视频数据集进行训练。为了实现这一点，我们提出了一种简单而有效的方法来从这些视频中选择合适的关键帧。在用户研究中，与最先进的替代方法相比，我们的方法在 83.2% 的情况下是首选。并使用由会说话的人组成的野外视频数据集进行训练。为了实现这一点，我们提出了一种简单而有效的方法来从这些视频中选择合适的关键帧。在用户研究中，与最先进的替代方法相比，我们的方法在 83.2% 的情况下是首选。并使用由会说话的人组成的野外视频数据集进行训练。为了实现这一点，我们提出了一种简单而有效的方法来从这些视频中选择合适的关键帧。在用户研究中，与最先进的替代方法相比，我们的方法在 83.2% 的情况下是首选。

更新日期：2020-08-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11