Semantically consistent text to fashion image synthesis with an enhanced attentional generative adversarial network,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Semantically consistent text to fashion image synthesis with an enhanced attentional generative adversarial network
Pattern Recognition Letters ( IF 3.9 ) Pub Date : 2020-03-27 , DOI: 10.1016/j.patrec.2020.02.030
Kenan E. Ak , Joo Hwee Lim , Jo Yew Tham , Ashraf A. Kassim

Recent advancements in Generative Adversarial Networks (GANs) have led to significant improvements in various image generation tasks including image synthesis based on text descriptions. In this paper, we present an enhanced Attentional Generative Adversarial Network (e-AttnGAN) with improved training stability for text-to-image synthesis. e-AttnGAN’s integrated attention module utilizes both sentence and word context features and performs feature-wise linear modulation (FiLM) to fuse visual and natural language representations. In addition to multimodal similarity learning for text and image features of AttnGAN [1], similarity and feature matching losses between real and generated images are included while employing classification losses for “significant attributes”. In order to improve the stability of the training and solve the mode collapse issue, spectral normalization and two-time scale update rule are used for the discriminator together with instance noise. Our experiments show that e-AttnGAN outperforms state-of-the-art methods using the FashionGen and DeepFashion-Synthesis datasets in terms of inception score, R-precision and classification accuracy. A detailed ablation study has been conducted to observe the effect of each component.

中文翻译：

语义一致的文本，可通过增强的注意力生成对抗网络来进行图像合成

生成对抗网络（GAN）的最新进展已导致各种图像生成任务的重大改进，包括基于文本描述的图像合成。在本文中，我们提出了一种增强的注意力生成对抗网络（e-AttnGAN），具有改进的训练稳定性，可用于文本到图像的合成。e-AttnGAN的集成注意模块利用句子和单词的上下文功能，并执行按功能的线性调制（FiLM）以融合视觉和自然语言表示。除了针对AttnGAN [1]的文本和图像特征进行多模式相似性学习外，还包括对真实图像和生成图像之间的相似度和特征匹配损失，同时将分类损失用于“重要属性”。为了提高训练的稳定性并解决模式崩溃问题，将频谱归一化和二次尺度更新规则与实例噪声一起用于鉴别器。我们的实验表明，在启动得分，R精度和分类准确性方面，e-AttnGAN优于使用FashionGen和DeepFashion-Synthesis数据集的最新方法。已经进行了详细的消融研究，以观察每种成分的作用。

更新日期：2020-03-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11