当前位置: X-MOL 学术ACM Trans. Graph. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MakeltTalk
ACM Transactions on Graphics  ( IF 6.2 ) Pub Date : 2020-11-27 , DOI: 10.1145/3414685.3417774
Yang Zhou 1 , Xintong Han 2 , Eli Shechtman 3 , Jose Echevarria 3 , Evangelos Kalogerakis 1 , Dingzeyu Li 3
Affiliation  

We present a method that generates expressive talking-head videos from a single facial image with audio as the only input. In contrast to previous attempts to learn direct mappings from audio to raw pixels for creating talking faces, our method first disentangles the content and speaker information in the input audio signal. The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking-head dynamics. Another key component of our method is the prediction of facial landmarks reflecting the speaker-aware dynamics. Based on this intermediate representation, our method works with many portrait images in a single unified framework, including artistic paintings, sketches, 2D cartoon characters, Japanese mangas, and stylized caricatures. In addition, our method generalizes well for faces and characters that were not observed during training. We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating generated talking-heads of significantly higher quality compared to prior state-of-the-art methods.

中文翻译:

MakeltTalk

我们提出了一种方法,该方法可以从以音频为唯一输入的单个面部图像生成富有表现力的说话头视频。与之前尝试学习从音频到原始像素的直接映射以创建说话的面孔相比,我们的方法首先解开输入音频信号中的内容和说话者信息。音频内容有力地控制了嘴唇和附近面部区域的运动,而说话者信息则决定了面部表情的细节和说话的头部动态的其余部分。我们方法的另一个关键组成部分是预测反映说话者感知动态的面部标志。基于这种中间表示,我们的方法在一个统一的框架中处理许多肖像图像,包括艺术绘画、素描、2D 卡通人物、日本漫画、和程式化的漫画。此外,我们的方法很好地概括了训练期间未观察到的面部和角色。除了用户研究之外,我们还对我们的方法进行了广泛的定量和定性评估,证明与先前最先进的方法相比,生成的谈话头质量明显更高。
更新日期:2020-11-27
down
wechat
bug