Skip to main content
Log in

A Comprehensive Pipeline for Complex Text-to-Image Synthesis

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem. It needs to solve several difficult tasks across the fields of natural language processing and computer vision. We model it as a combination of semantic entity recognition, object retrieval and recombination, and objects’ status optimization. To reach a satisfactory result, we propose a comprehensive pipeline to convert the input text to its visual counterpart. The pipeline includes text processing, foreground objects and background scene retrieval, image synthesis using constrained MCMC, and post-processing. Firstly, we roughly divide the objects parsed from the input text into foreground objects and background scenes. Secondly, we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset, and retrieve an appropriate background scene image from the background image dataset extracted from the Internet. Thirdly, in order to ensure the rationality of foreground objects’ positions and sizes in the image synthesis step, we design a cost function and use the Markov Chain Monte Carlo (MCMC) method as the optimizer to solve this constrained layout problem. Finally, to make the image look natural and harmonious, we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step. The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks (GANs) in visual quality of generated scene images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Lin T Y, Maire M, Belongie S et al. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.740-755.

  2. Krishna R, Zhu Y, Groth O et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123(1): 32-73.

    Article  MathSciNet  Google Scholar 

  3. Mansimov E, Parisotto E, Ba J L et al. Generating images from captions with attention. arXiv:1511.02793, 2015. https://arxiv.org/abs/1511.02793, October 2019.

  4. Reed S, Akata Z, Yan X et al. Generative adversarial text to image synthesis. arXiv:1605.05396, 2016. https://arxiv.org/abs/1605.05396, October 2019.

  5. Zhang H, Xu T, Li H et al. StackGAN: Text to photorealistic image synthesis with stacked generative adversarial networks. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.5907-5915.

  6. Lalonde J F, Hoiem D, Efros A A et al. Photo clip art. ACM Transactions on Graphics, 2007, 26(3): Article No. 3.

  7. Chen T, Cheng M M, Tan P et al. Sketch2Photo: Internet image montage. ACM Transactions on Graphics, 2009, 28(5): Article No. 124.

  8. Chen T, Tan P, Ma L Q et al. PoseShop: Human image database construction and personalized content synthesis. IEEE Transactions on Visualization and Computer Graphics, 2013, 19(5): 824-837.

    Article  Google Scholar 

  9. Fang F, Yi M, Feng H et al. Narrative collage of image collections by scene graph recombination. IEEE Transactions on Visualization and Computer Graphics, 2018, 24(9): 2559-2572.

    Article  Google Scholar 

  10. Zitnick C L, Parikh D. Bringing semantics into focus using visual abstraction. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.3009-3016.

  11. Zitnick C L, Parikh D, Vanderwende L. Learning the visual interpretation of sentences. In Proc. the IEEE International Conference on Computer Vision, December 2013, pp.1681-1688.

  12. Coyne B, Sproat R. WordsEye: An automatic text-to-scene conversion system. In Proc. the 28th Annual Conference on Computer Graphics and Interactive Techniques, August 2001, pp.487-496.

  13. Chang A, Savva M, Manning C D. Learning spatial knowledge for text to 3D scene generation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, October 2014, pp.2028-2038.

  14. Reed S, van den Oord A, Kalchbrenner N et al. Generating interpretable images with controllable structure. In Proc. the International Conference on Learning Representations, April 2017.

  15. Goodfellow I, Pouget-Abadie J, Mirza M et al. Generative adversarial nets. In Proc. the Annual Conference on Neural Information Processing Systems, December 2014, pp.2672-2680.

  16. Reed S E, Akata Z, Mohan S et al. Learning what and where to draw. In Proc. the Annual Conference on Neural Information Processing Systems, December 2016, pp.217-225.

  17. Zhang H, Xu T, Li H et al. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947-1962.

    Article  Google Scholar 

  18. Xu T, Zhang P, Huang Q et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.1316-1324.

  19. Yin G, Liu B, Sheng L et al. Semantics disentangling for text-to-image generation. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.2327-2336.

  20. Zhou X, Huang S, Li B et al. Text guided person image synthesis. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.3663-3672.

  21. Tan H, Liu X, Li X et al. Semantics-enhanced adversarial nets for text-to-image synthesis. In Proc. the IEEE International Conference on Computer Vision, October 2019, pp.10500-10509.

  22. Qiao T, Zhang J, Xu D et al. MirrorGAN: Learning text-to-image generation by redescription. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.1505-1514.

  23. Johnson J, Gupta A, Li F F. Image generation from scene graphs. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.1219-1228.

  24. Li W, Zhang P, Zhang L et al. Object-driven text-to-image synthesis via adversarial training. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.12174-12182.

  25. Hinz T, Heinrich S, Wermter S. Generating multiple objects at spatially distinct locations. arXiv:1901.00686, 2019. https://arxiv.org/abs/1901.00686, October 2019.

  26. Xu K, Ba J, Kiros R et al. Show, attend and tell: Neural image caption generation with visual attention. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.2048-2057.

  27. Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.3128-3137.

  28. Johnson J, Karpathy A, Li F F. DenseCap: Fully convolutional localization networks for dense captioning. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4565-4574.

  29. Krause J, Johnson J, Krishna R et al. A hierarchical approach for generating descriptive image paragraphs. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.3337-3345.

  30. Yao L, Torabi A, Cho K et al. Describing videos by exploiting temporal structure. In Proc. the IEEE International Conference on Computer Vision, December 2015, pp.4507-4515.

  31. Yu H, Wang J, Huang Z et al. Video paragraph captioning using hierarchical recurrent neural networks. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4584-4593.

  32. Li A, Sun J, Ng J Y H et al. Generating holistic 3D scene abstractions for text-based image retrieval. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1942-1950.

  33. Fellbaum C. WordNet. In Theory and Applications of Ontology: Computer Applications, Poli P, Healy M, Kameas A (eds.), Springer Netherlands, 2010, pp.231-243.

  34. He K, Gkioxari G, Dollár P et al. Mask R-CNN. In Proc. the IEEE International Conference on Computer Vision, October 2017, pp.2980-2988.

  35. Laina I, Rupprecht C, Belagiannis V et al. Deeper depth prediction with fully convolutional residual networks. In Proc. the 4th International Conference on 3D Vision, October 2016, pp.239-248.

  36. Yeh Y T, Yang L, Watson M et al. Synthesizing open worlds with constraints using locally annealed reversible jump MCMC. ACM Transactions on Graphics, 2012, 31(4): Article No. 56.

  37. Pérez P, Gangnet M, Blake A. Poisson image editing. ACM Transactions on Graphics, 2003, 22(3): 313-318.

    Article  Google Scholar 

  38. Liao Z, Karsch K, Forsyth D. An approximate shading model for object relighting. In Proc. the IEEE Conference on Computer Vission and Pattern Recognition, June 2015, pp.5307-5314.

  39. Elder J H. Shape from contour: Computation and representation. Annual Review of Vision Science, 2018, 4(1): 423-450.

    Article  Google Scholar 

  40. Johnston S F. Lumo: Illumination for cel animation. In Proc. the 2nd International Symposium on Non-Photorealistic Animation and Rendering, June 2002, pp.45-52.

  41. Wu T P, Sun J, Tang C K et al. Interactive normal reconstruction from a single image. ACM Transactions on Graphics, 2008, 27(5): Article No. 119.

  42. Grosse R, Johnson M K, Adelson E H et al. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In Proc. the 12th IEEE International Conference on Computer Vision, September 2009, pp.2335-2342.

  43. Karsch K, Sunkavalli K, Hadap S et al. Automatic scene inference for 3D object compositing. ACM Transactions on Graphics, 2014, 33(3): Article No. 32.

  44. Godard C, Aodha M O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.6602-6611.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chun-Xia Xiao.

Electronic supplementary material

ESM 1

(PDF 884 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, F., Luo, F., Zhang, HP. et al. A Comprehensive Pipeline for Complex Text-to-Image Synthesis. J. Comput. Sci. Technol. 35, 522–537 (2020). https://doi.org/10.1007/s11390-020-0305-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-020-0305-9

Keywords

Navigation