Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

Mao, Fengling; Ma, Bingpeng; Chang, Hong; Shan, Shiguang; Chen, Xilin

doi:10.1007/s11432-020-2900-x

Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

Research Paper
Special Focus on Deep Learning for Computer Vision
Published: 17 November 2020

Volume 64, article number 120102, (2021)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Fengling Mao^1,2,
Bingpeng Ma³,
Hong Chang^2,3,
Shiguang Shan^2,3,4 &
…
Xilin Chen^2,3

232 Accesses
9 Citations
3 Altmetric
Explore all metrics

Abstract

For a given text, previous text-to-image synthesis methods commonly utilize a multistage generation model to produce images with high resolution in a coarse-to-fine manner. However, these methods ignore the interaction among stages, and they do not constrain the consistent cross-sample relations of images generated in different stages. These deficiencies result in inefficient generation and discrimination. In this study, we propose an interstage cross-sample similarity distillation model based on a generative adversarial network (GAN) for learning efficient text-to-image synthesis. To strengthen the interaction among stages, we achieve interstage knowledge distillation from the refined stage to the coarse stages with novel interstage cross-sample similarity distillation blocks. To enhance the constraint on the cross-sample relations of the images generated at different stages, we conduct cross-sample similarity distillation among the stages. Extensive experiments on the Oxford-102 and Caltech-UCSD Birds-200–2011 (CUB) datasets show that our model generates visually pleasing images and achieves quantitatively comparable performance with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

Ramprasaath R. Selvaraju, Michael Cogswell, … Dhruv Batra

Image Generation: A Review

Article 11 March 2022

Mohamed Elasri, Omar Elharrouss, … Hamid Tairi

References

Chen X, Duan Y, Houthooft R, et al. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2172–2180
Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672–2680
Kingma D P, Welling M. Auto-encoding variational Bayes. 2013. ArXiv: 1312.6114
Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of International Conference on Learning Representations, 2019
Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. 2018. ArXiv: 1812.04948
Xiong W, Luo W H, Ma L, et al. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2364–2373
Xiong W, Lin Z, Yang J M, et al. Foreground-aware image inpainting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 5840–5848
Isola P, Zhu J-Y, Zhou T H, et al. Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1125–1134
Zhu J-Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2223–2232
Miao Y W, Liu J Z, Chen J H, et al. Structure-preserving shape completion of 3D point clouds with generative adversarial network (in Chinese). Sci Sin Inform, 2020, 50: 675–691
Article Google Scholar
Li Y H, Ao D Y, Dumitru C O, et al. Super-resolution of geosynchronous synthetic aperture radar images using dialectical GANs. Sci China Inf Sci, 2019, 62: 209302
Article Google Scholar
Reed S, Akata Z, Yan X C, et al. Generative adversarial text to image synthesis. In: Proceedings of International Conference on Machine Learning, 2016
Zhang H, Xu T, Li H S, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 5907–5915
Zhang H, Xu T, Li H S, et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell, 2018, 41: 1947–1962
Article Google Scholar
Xu T, Zhang P C, Huang Q Y, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1316–1324
Zhang Z Z, Xie Y P, Yang L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6199–6208
Mao F L, Ma B P, Chang H, et al. MS-GAN: text to image synthesis with attention-modulated generators and similarity-aware discriminators. In: Proceedings of British Machine Vision Conference, 2019
Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. 2015. ArXiv: 1511.06434
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015. ArXiv: 1503.02531
Nilsback M, Zisserman A. Automated flower classification over a large number of classes. In: Proceedings of Indian Conference on Computer Vision, Graphics & Image Processing, 2008. 722–729
Wah C, Branson S, Welinder P, et al. The Caltech-UCSD Birds-200–2011 Dataset. Technical Report CNS-TR-2011-001. California Institute of Technology. 2011
Salimans T, Goodfellow I, Zaremba W, et al. Improved techniques for training GANs. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2234–2242
Heusel M, Ramsauer H, Unterthiner T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 6626–6637
Mirza M, Osindero S. Conditional generative adversarial nets. 2014. ArXiv: 1411.1784
Pang Y W, Xie J, Li X L. Visual haze removal by a unified generative adversarial network. IEEE Trans Circ Syst Video Tech, 2019, 29: 3211–3221
Article Google Scholar
Mo S, Cho M, Shin J. InstaGAN: instance-aware image-to-image translation. In: Proceedings of International Conference on Learning Representations, 2019
Zhu Z, Huang T T, Shi B G, et al. Progressive pose attention transfer for person image generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019
Zhang Z J, Pang Y W. CGNet: cross-guidance network for semantic segmentation. Sci China Inf Sci, 2020, 63: 120104
Article Google Scholar
Liao M H, Song B Y, Long S B, et al. SynthText3D: synthesizing scene text images from 3D virtual worlds. Sci China Inf Sci, 2020, 63: 120105
Article Google Scholar
Reed S, Akata Z, Lee H, et al. Learning deep representations of fine-grained visual descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 49–58
Ji Z, Wang H R, Han J G, et al. Saliency-guided attention network for image-sentence matching. In: Proceedings of IEEE International Conference on Computer Vision, 2019
Qiao T T, Zhang J, Xu D Q, et al. MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1505–1514
Qiao T T, Zhang J, Xu D Q, et al. Learn, imagine and create: text-to-image generation from prior knowledge. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 885–895
Li W B, Zhang P C, Zhang L, et al. Object-driven text-to-image synthesis via adversarial training. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019
Huang Z H, Wang N Y. Like what you like: knowledge distill via neuron selectivity transfer. 2017. ArXiv: 1707.01219
Yim J, Joo D, Bae J, et al. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4133–4141
Romero A, Ballas N, Kahou S E, et al. Fitnets: hints for thin deep nets. 2014. ArXiv: 1412.6550
Zagoruyko S, Komodakis N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. 2016. ArXiv: 1612.03928
Gu X Q, Ma B P, Chang H, et al. Temporal knowledge propagation for image-to-video person re-identification. In: Proceedings of IEEE International Conference on Computer Vision, 2019. 9647–9656
Yuan M K, Peng Y X. Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of ACM International Conference on Multimedia, 2018
Chen Y T, Wang N Y, Zhang Z X. Darkrank: accelerating deep metric learning via cross sample similarities transfer. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018
Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks. In: Proceedings of International Conference on Machine Learning, 2017. 933–941
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015
Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2818–2826
Deng J, Dong W, Socher R, et al. Imagenet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255
Abadi M, Barham P, Chen J M, et al. Tensorflow: a system for large-scale machine learning. In: Proceedings of Symposium on Operating Systems Design and Implementation, 2016. 265–283
Reed S E, Akata Z, Mohan S, et al. Learning what and where to draw. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 217–225

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61876171, 61976203) and Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

School of Information Science and Technology, ShanghaiTech University, Shanghai, 201210, China
Fengling Mao
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Fengling Mao, Hong Chang, Shiguang Shan & Xilin Chen
University of Chinese Academy of Sciences, Beijing, 100049, China
Bingpeng Ma, Hong Chang, Shiguang Shan & Xilin Chen
CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, 200031, China
Shiguang Shan

Authors

Fengling Mao
View author publications
You can also search for this author in PubMed Google Scholar
Bingpeng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Hong Chang
View author publications
You can also search for this author in PubMed Google Scholar
Shiguang Shan
View author publications
You can also search for this author in PubMed Google Scholar
Xilin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bingpeng Ma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mao, F., Ma, B., Chang, H. et al. Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation. Sci. China Inf. Sci. 64, 120102 (2021). https://doi.org/10.1007/s11432-020-2900-x

Download citation

Received: 21 January 2020
Revised: 08 March 2020
Accepted: 26 April 2020
Published: 17 November 2020
DOI: https://doi.org/10.1007/s11432-020-2900-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

Abstract

Access this article

Similar content being viewed by others

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Image Generation: A Review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

Abstract

Access this article

Similar content being viewed by others

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Image Generation: A Review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation