当前位置: X-MOL 学术arXiv.cs.GR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Robust Pose Transfer with Dynamic Details using Neural Video Rendering
arXiv - CS - Graphics Pub Date : 2021-06-27 , DOI: arxiv-2106.14132
Yang-tian Sun, Hao-zhi Huang, Xuan Wang, Yu-kun Lai, Wei Liu, Lin Gao

Pose transfer of human videos aims to generate a high fidelity video of a target person imitating actions of a source person. A few studies have made great progress either through image translation with deep latent features or neural rendering with explicit 3D features. However, both of them rely on large amounts of training data to generate realistic results, and the performance degrades on more accessible internet videos due to insufficient training frames. In this paper, we demonstrate that the dynamic details can be preserved even trained from short monocular videos. Overall, we propose a neural video rendering framework coupled with an image-translation-based dynamic details generation network (D2G-Net), which fully utilizes both the stability of explicit 3D features and the capacity of learning components. To be specific, a novel texture representation is presented to encode both the static and pose-varying appearance characteristics, which is then mapped to the image space and rendered as a detail-rich frame in the neural rendering stage. Moreover, we introduce a concise temporal loss in the training stage to suppress the detail flickering that is made more visible due to high-quality dynamic details generated by our method. Through extensive comparisons, we demonstrate that our neural human video renderer is capable of achieving both clearer dynamic details and more robust performance even on accessible short videos with only 2k - 4k frames.

中文翻译:

使用神经视频渲染的具有动态细节的稳健姿势转移

人类视频的姿势转移旨在生成目标人模仿源人动作的高保真视频。通过具有深层潜在特征的图像翻译或具有显式 3D 特征的神经渲染,一些研究取得了很大进展。然而,它们都依赖于大量的训练数据来产生真实的结果,并且由于训练帧不足,在更容易访问的互联网视频上性能下降。在本文中,我们证明即使从短的单眼视频中训练也可以保留动态细节。总的来说,我们提出了一个神经视频渲染框架,结合基于图像翻译的动态细节生成网络(D2G-Net),它充分利用了显式 3D 特征的稳定性和学习组件的能力。再具体一点,提出了一种新颖的纹理表示来编码静态和姿态变化的外观特征,然后将其映射到图像空间并在神经渲染阶段渲染为细节丰富的帧。此外,我们在训练阶段引入了简洁的时间损失,以抑制由于我们的方法生成的高质量动态细节而变得更加明显的细节闪烁。通过广泛的比较,我们证明我们的神经人类视频渲染器即使在只有 2k - 4k 帧的可访问短视频上也能够实现更清晰的动态细节和更强大的性能。我们在训练阶段引入了简洁的时间损失,以抑制由于我们的方法生成的高质量动态细节而变得更加明显的细节闪烁。通过广泛的比较,我们证明我们的神经人类视频渲染器即使在只有 2k - 4k 帧的可访问短视频上也能够实现更清晰的动态细节和更强大的性能。我们在训练阶段引入了简洁的时间损失,以抑制由于我们的方法生成的高质量动态细节而变得更加明显的细节闪烁。通过广泛的比较,我们证明我们的神经人类视频渲染器即使在只有 2k - 4k 帧的可访问短视频上也能够实现更清晰的动态细节和更强大的性能。
更新日期:2021-06-29
down
wechat
bug