当前位置: X-MOL 学术Digit. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cross-modal Feature Alignment based Hybrid Attentional Generative Adversarial Networks for text-to-image synthesis
Digital Signal Processing ( IF 2.9 ) Pub Date : 2020-09-30 , DOI: 10.1016/j.dsp.2020.102866
Qingrong Cheng , Xiaodong Gu

With the development of the generative model, image synthesis has become a research hotspot. This paper presents a novel Cross-modal Feature Alignment based Hybrid Attentional Generative Adversarial Networks (CFA-HAGAN) for text-to-image synthesis. It mainly consists of two steps, text-image encoding and text-to-image synthesis. Text-image encoding learns a Cross-modal Feature Alignment Model (CFAM), which adopts a fine-grained attentional network to learn the original multi-modalities' aligned features. The feature alignment space is viewed as the transitional space in the whole process. Then, the Hybrid Attentional Generative Adversarial Networks (HAGAN) learns the inverse mapping from the encoded text feature to the original image. Specifically, the hybrid attention block consists of text-image cross-modal attention mechanism and self-attention mechanism of an image. Cross-modal attention makes the synthesized image fine-grained by adding word-level information as additional supervision. Self-attention can solve the long-distance reliance problem of image sub-region features when synthesizes images from the hidden feature. Although excellent performance in an ocean of tasks, GANs are well-known for the difficulty of training. Adopting spectral normalization, the discriminators are satisfied with 1-Lipschitz constraint, which makes their training process more stable than original GANs. During quantitative and non-quantitative comparison with many state-of-the-art methods, the experimental results show that the proposed method achieves better performance on evaluation metric and visual effect. Besides, the experimental section presents attention visualization, ablation study, and generalization ability analysis to show the effectiveness of the proposed method.



中文翻译:

基于交叉模式特征对齐的混合注意力生成对抗网络,用于文本到图像的合成

随着生成模型的发展,图像合成已成为研究热点。本文提出了一种新的跨模态特征比对基于混合动力的注意剖成对抗性网络(CFA-HAGAN)用于文本到图像的合成。它主要包括两个步骤,文本图像编码和文本图像合成。文本图像编码学习跨模式特征对齐模型(CFAM),该模型采用细粒度的注意力网络来学习原始多模式的对齐特征。在整个过程中,特征对齐空间被视为过渡空间。然后,混合注意力生成对抗网络(HAGAN)学习从编码文本特征到原始图像的逆映射。具体来说,混合注意力块由文本图像的跨模式注意力机制和图像的自我注意机制组成。通过添加词级信息作为附加监督,跨模式注意力可以使合成图像更精细。当从隐藏特征合成图像时,自注意力可以解决图像子区域特征的远程依赖问题。尽管GAN在众多任务中表现出色,但GAN却因训练困难而闻名。通过使用频谱归一化,鉴别器满足1-Lipschitz约束,这使得它们的训练过程比原始GAN更稳定。在与许多最新方法进行定量和非定量比较时,实验结果表明,该方法在评估指标和视觉效果上具有更好的性能。此外,实验部分还提供了注意力可视化,消融研究和泛化能力分析,以证明该方法的有效性。

更新日期:2020-10-11
down
wechat
bug