当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High-Quality Video Generation from Static Structural Annotations
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2020-05-28 , DOI: 10.1007/s11263-020-01334-x
Lu Sheng , Junting Pan , Jiaming Guo , Jing Shao , Chen Change Loy

This paper proposes a novel unsupervised video generation that is conditioned on a single structural annotation map, which in contrast to prior conditioned video generation approaches, provides a good balance between motion flexibility and visual quality in the generation process. Different from end-to-end approaches that model the scene appearance and dynamics in a single shot, we try to decompose this difficult task into two easier sub-tasks in a divide-and-conquer fashion, thus achieving remarkable results overall. The first sub-task is an image-to-image (I2I) translation task that synthesizes high-quality starting frame from the input structural annotation map. The second image-to-video (I2V) generation task applies the synthesized starting frame and the associated structural annotation map to animate the scene dynamics for the generation of a photorealistic and temporally coherent video. We employ a cycle-consistent flow-based conditioned variational autoencoder to capture the long-term motion distributions, by which the learned bi-directional flows ensure the physical reliability of the predicted motions and provide explicit occlusion handling in a principled manner. Integrating structural annotations into the flow prediction also improves the structural awareness in the I2V generation process. Quantitative and qualitative evaluations over the autonomous driving and human action datasets demonstrate the effectiveness of the proposed approach over the state-of-the-art methods. The code has been released: https://github.com/junting/seg2vid .

中文翻译:

从静态结构注释生成高质量视频

本文提出了一种新的无监督视频生成,它以单个结构注释图为条件,与先前的条件视频生成方法相比,在生成过程中提供了运动灵活性和视觉质量之间的良好平衡。与在单个镜头中对场景外观和动态进行建模的端到端方法不同,我们尝试以分而治之的方式将这个艰巨的任务分解为两个更简单的子任务,从而总体上取得了显着的效果。第一个子任务是图像到图像 (I2I) 转换任务,它从输入的结构注释图中合成高质量的起始帧。第二个图像到视频 (I2V) 生成任务应用合成的起始帧和相关的结构注释图来为场景动态设置动画,以生成逼真和时间相干的视频。我们采用基于循环一致流的条件变分自编码器来捕获长期运动分布,由此学习的双向流确保预测运动的物理可靠性,并以有原则的方式提供明确的遮挡处理。将结构注释集成到流量预测中也提高了 I2V 生成过程中的结构意识。对自动驾驶和人类行为数据集的定量和定性评估证明了所提出的方法相对于最先进方法的有效性。
更新日期:2020-05-28
down
wechat
bug