Skip to main content
Log in

Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

  • Research Paper
  • Special Focus on Deep Learning for Computer Vision
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

For a given text, previous text-to-image synthesis methods commonly utilize a multistage generation model to produce images with high resolution in a coarse-to-fine manner. However, these methods ignore the interaction among stages, and they do not constrain the consistent cross-sample relations of images generated in different stages. These deficiencies result in inefficient generation and discrimination. In this study, we propose an interstage cross-sample similarity distillation model based on a generative adversarial network (GAN) for learning efficient text-to-image synthesis. To strengthen the interaction among stages, we achieve interstage knowledge distillation from the refined stage to the coarse stages with novel interstage cross-sample similarity distillation blocks. To enhance the constraint on the cross-sample relations of the images generated at different stages, we conduct cross-sample similarity distillation among the stages. Extensive experiments on the Oxford-102 and Caltech-UCSD Birds-200–2011 (CUB) datasets show that our model generates visually pleasing images and achieves quantitatively comparable performance with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Chen X, Duan Y, Houthooft R, et al. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2172–2180

  2. Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672–2680

  3. Kingma D P, Welling M. Auto-encoding variational Bayes. 2013. ArXiv: 1312.6114

  4. Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of International Conference on Learning Representations, 2019

  5. Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. 2018. ArXiv: 1812.04948

  6. Xiong W, Luo W H, Ma L, et al. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2364–2373

  7. Xiong W, Lin Z, Yang J M, et al. Foreground-aware image inpainting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 5840–5848

  8. Isola P, Zhu J-Y, Zhou T H, et al. Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1125–1134

  9. Zhu J-Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2223–2232

  10. Miao Y W, Liu J Z, Chen J H, et al. Structure-preserving shape completion of 3D point clouds with generative adversarial network (in Chinese). Sci Sin Inform, 2020, 50: 675–691

    Article  Google Scholar 

  11. Li Y H, Ao D Y, Dumitru C O, et al. Super-resolution of geosynchronous synthetic aperture radar images using dialectical GANs. Sci China Inf Sci, 2019, 62: 209302

    Article  Google Scholar 

  12. Reed S, Akata Z, Yan X C, et al. Generative adversarial text to image synthesis. In: Proceedings of International Conference on Machine Learning, 2016

  13. Zhang H, Xu T, Li H S, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 5907–5915

  14. Zhang H, Xu T, Li H S, et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell, 2018, 41: 1947–1962

    Article  Google Scholar 

  15. Xu T, Zhang P C, Huang Q Y, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1316–1324

  16. Zhang Z Z, Xie Y P, Yang L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6199–6208

  17. Mao F L, Ma B P, Chang H, et al. MS-GAN: text to image synthesis with attention-modulated generators and similarity-aware discriminators. In: Proceedings of British Machine Vision Conference, 2019

  18. Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. 2015. ArXiv: 1511.06434

  19. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015. ArXiv: 1503.02531

  20. Nilsback M, Zisserman A. Automated flower classification over a large number of classes. In: Proceedings of Indian Conference on Computer Vision, Graphics & Image Processing, 2008. 722–729

  21. Wah C, Branson S, Welinder P, et al. The Caltech-UCSD Birds-200–2011 Dataset. Technical Report CNS-TR-2011-001. California Institute of Technology. 2011

  22. Salimans T, Goodfellow I, Zaremba W, et al. Improved techniques for training GANs. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2234–2242

  23. Heusel M, Ramsauer H, Unterthiner T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 6626–6637

  24. Mirza M, Osindero S. Conditional generative adversarial nets. 2014. ArXiv: 1411.1784

  25. Pang Y W, Xie J, Li X L. Visual haze removal by a unified generative adversarial network. IEEE Trans Circ Syst Video Tech, 2019, 29: 3211–3221

    Article  Google Scholar 

  26. Mo S, Cho M, Shin J. InstaGAN: instance-aware image-to-image translation. In: Proceedings of International Conference on Learning Representations, 2019

  27. Zhu Z, Huang T T, Shi B G, et al. Progressive pose attention transfer for person image generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019

  28. Zhang Z J, Pang Y W. CGNet: cross-guidance network for semantic segmentation. Sci China Inf Sci, 2020, 63: 120104

    Article  Google Scholar 

  29. Liao M H, Song B Y, Long S B, et al. SynthText3D: synthesizing scene text images from 3D virtual worlds. Sci China Inf Sci, 2020, 63: 120105

    Article  Google Scholar 

  30. Reed S, Akata Z, Lee H, et al. Learning deep representations of fine-grained visual descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 49–58

  31. Ji Z, Wang H R, Han J G, et al. Saliency-guided attention network for image-sentence matching. In: Proceedings of IEEE International Conference on Computer Vision, 2019

  32. Qiao T T, Zhang J, Xu D Q, et al. MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1505–1514

  33. Qiao T T, Zhang J, Xu D Q, et al. Learn, imagine and create: text-to-image generation from prior knowledge. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 885–895

  34. Li W B, Zhang P C, Zhang L, et al. Object-driven text-to-image synthesis via adversarial training. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019

  35. Huang Z H, Wang N Y. Like what you like: knowledge distill via neuron selectivity transfer. 2017. ArXiv: 1707.01219

  36. Yim J, Joo D, Bae J, et al. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4133–4141

  37. Romero A, Ballas N, Kahou S E, et al. Fitnets: hints for thin deep nets. 2014. ArXiv: 1412.6550

  38. Zagoruyko S, Komodakis N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. 2016. ArXiv: 1612.03928

  39. Gu X Q, Ma B P, Chang H, et al. Temporal knowledge propagation for image-to-video person re-identification. In: Proceedings of IEEE International Conference on Computer Vision, 2019. 9647–9656

  40. Yuan M K, Peng Y X. Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of ACM International Conference on Multimedia, 2018

  41. Chen Y T, Wang N Y, Zhang Z X. Darkrank: accelerating deep metric learning via cross sample similarities transfer. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018

  42. Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks. In: Proceedings of International Conference on Machine Learning, 2017. 933–941

  43. Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015

  44. Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2818–2826

  45. Deng J, Dong W, Socher R, et al. Imagenet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255

  46. Abadi M, Barham P, Chen J M, et al. Tensorflow: a system for large-scale machine learning. In: Proceedings of Symposium on Operating Systems Design and Implementation, 2016. 265–283

  47. Reed S E, Akata Z, Mohan S, et al. Learning what and where to draw. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 217–225

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61876171, 61976203) and Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bingpeng Ma.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mao, F., Ma, B., Chang, H. et al. Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation. Sci. China Inf. Sci. 64, 120102 (2021). https://doi.org/10.1007/s11432-020-2900-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-020-2900-x

Keywords

Navigation