当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards Open-World Text-Guided Face Image Generation and Manipulation
arXiv - CS - Multimedia Pub Date : 2021-04-18 , DOI: arxiv-2104.08910
Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu

The existing text-guided image synthesis methods can only produce limited quality results with at most \mbox{$\text{256}^2$} resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing. To be specific, we propose a brand new paradigm of text-guided image generation and manipulation based on the superior characteristics of a pretrained GAN model. Our proposed paradigm includes two novel strategies. The first strategy is to train a text encoder to obtain latent codes that align with the hierarchically semantic of the aforementioned pretrained GAN model. The second strategy is to directly optimize the latent codes in the latent space of the pretrained GAN model with guidance from a pretrained language model. The latent codes can be randomly sampled from a prior distribution or inverted from a given image, which provides inherent supports for both image generation and manipulation from multi-modal inputs, such as sketches or semantic labels, with textual guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.

中文翻译:

走向开放文本指导的人脸图像生成和操纵

现有的文本引导图像合成方法最多只能产生\ mbox {$ \ text {256} ^ 2 $}分辨率的有限质量的结果,并且文本指令被限制在一个小的语料库中。在这项工作中,我们提出了一个用于面部图像生成和操纵的统一框架,该框架可通过多模式输入以1024分辨率产生前所未有的高分辨率的各种高质量图像。更重要的是,我们的方法支持开放世界的场景,包括图像和文本,而无需任何重新训练,微调或后处理。具体而言,我们基于预训练的GAN模型的优越特性,提出了一种文本引导图像生成和处理的全新范例。我们提出的范例包括两种新颖的策略。第一种策略是训练文本编码器以获得与前述预训练的GAN模型的层次语义一致的潜在代码。第二种策略是在预训练语言模型的指导下直接优化预训练GAN模型的潜在空间中的潜在代码。潜在代码可以从先验分布中随机采样,也可以从给定图像中反转,从而为图像生成和文本指导下的多模式输入(例如草图或语义标签)的操作提供了内在的支持。为了促进文本指导的多模式综合,我们提出了多模式CelebA-HQ,这是一个由真实面孔图像和相应的语义分割图,草图以及文本描述组成的大规模数据集。对引入的数据集进行的大量实验证明了我们提出的方法的优越性能。代码和数据可从https://github.com/weihaox/TediGAN获得。
更新日期:2021-04-20
down
wechat
bug