A Comprehensive Pipeline for Complex Text-to-Image Synthesis

Fang, Fei; Luo, Fei; Zhang, Hong-Pan; Zhou, Hua-Jian; Chow, Alix L. H.; Xiao, Chun-Xia

doi:10.1007/s11390-020-0305-9

A Comprehensive Pipeline for Complex Text-to-Image Synthesis

Regular Paper
Published: 29 May 2020

Volume 35, pages 522–537, (2020)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Fei Fang¹,
Fei Luo¹,
Hong-Pan Zhang¹,
Hua-Jian Zhou¹,
Alix L. H. Chow² &
…
Chun-Xia Xiao¹

224 Accesses
10 Citations
Explore all metrics

Abstract

Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem. It needs to solve several difficult tasks across the fields of natural language processing and computer vision. We model it as a combination of semantic entity recognition, object retrieval and recombination, and objects’ status optimization. To reach a satisfactory result, we propose a comprehensive pipeline to convert the input text to its visual counterpart. The pipeline includes text processing, foreground objects and background scene retrieval, image synthesis using constrained MCMC, and post-processing. Firstly, we roughly divide the objects parsed from the input text into foreground objects and background scenes. Secondly, we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset, and retrieve an appropriate background scene image from the background image dataset extracted from the Internet. Thirdly, in order to ensure the rationality of foreground objects’ positions and sizes in the image synthesis step, we design a cost function and use the Markov Chain Monte Carlo (MCMC) method as the optimizer to solve this constrained layout problem. Finally, to make the image look natural and harmonious, we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step. The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks (GANs) in visual quality of generated scene images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Lin T Y, Maire M, Belongie S et al. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.740-755.
Krishna R, Zhu Y, Groth O et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123(1): 32-73.
Article MathSciNet Google Scholar
Mansimov E, Parisotto E, Ba J L et al. Generating images from captions with attention. arXiv:1511.02793, 2015. https://arxiv.org/abs/1511.02793, October 2019.
Reed S, Akata Z, Yan X et al. Generative adversarial text to image synthesis. arXiv:1605.05396, 2016. https://arxiv.org/abs/1605.05396, October 2019.
Zhang H, Xu T, Li H et al. StackGAN: Text to photorealistic image synthesis with stacked generative adversarial networks. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.5907-5915.
Lalonde J F, Hoiem D, Efros A A et al. Photo clip art. ACM Transactions on Graphics, 2007, 26(3): Article No. 3.
Chen T, Cheng M M, Tan P et al. Sketch2Photo: Internet image montage. ACM Transactions on Graphics, 2009, 28(5): Article No. 124.
Chen T, Tan P, Ma L Q et al. PoseShop: Human image database construction and personalized content synthesis. IEEE Transactions on Visualization and Computer Graphics, 2013, 19(5): 824-837.
Article Google Scholar
Fang F, Yi M, Feng H et al. Narrative collage of image collections by scene graph recombination. IEEE Transactions on Visualization and Computer Graphics, 2018, 24(9): 2559-2572.
Article Google Scholar
Zitnick C L, Parikh D. Bringing semantics into focus using visual abstraction. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.3009-3016.
Zitnick C L, Parikh D, Vanderwende L. Learning the visual interpretation of sentences. In Proc. the IEEE International Conference on Computer Vision, December 2013, pp.1681-1688.
Coyne B, Sproat R. WordsEye: An automatic text-to-scene conversion system. In Proc. the 28th Annual Conference on Computer Graphics and Interactive Techniques, August 2001, pp.487-496.
Chang A, Savva M, Manning C D. Learning spatial knowledge for text to 3D scene generation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, October 2014, pp.2028-2038.
Reed S, van den Oord A, Kalchbrenner N et al. Generating interpretable images with controllable structure. In Proc. the International Conference on Learning Representations, April 2017.
Goodfellow I, Pouget-Abadie J, Mirza M et al. Generative adversarial nets. In Proc. the Annual Conference on Neural Information Processing Systems, December 2014, pp.2672-2680.
Reed S E, Akata Z, Mohan S et al. Learning what and where to draw. In Proc. the Annual Conference on Neural Information Processing Systems, December 2016, pp.217-225.
Zhang H, Xu T, Li H et al. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947-1962.
Article Google Scholar
Xu T, Zhang P, Huang Q et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.1316-1324.
Yin G, Liu B, Sheng L et al. Semantics disentangling for text-to-image generation. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.2327-2336.
Zhou X, Huang S, Li B et al. Text guided person image synthesis. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.3663-3672.
Tan H, Liu X, Li X et al. Semantics-enhanced adversarial nets for text-to-image synthesis. In Proc. the IEEE International Conference on Computer Vision, October 2019, pp.10500-10509.
Qiao T, Zhang J, Xu D et al. MirrorGAN: Learning text-to-image generation by redescription. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.1505-1514.
Johnson J, Gupta A, Li F F. Image generation from scene graphs. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.1219-1228.
Li W, Zhang P, Zhang L et al. Object-driven text-to-image synthesis via adversarial training. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.12174-12182.
Hinz T, Heinrich S, Wermter S. Generating multiple objects at spatially distinct locations. arXiv:1901.00686, 2019. https://arxiv.org/abs/1901.00686, October 2019.
Xu K, Ba J, Kiros R et al. Show, attend and tell: Neural image caption generation with visual attention. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.2048-2057.
Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.3128-3137.
Johnson J, Karpathy A, Li F F. DenseCap: Fully convolutional localization networks for dense captioning. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4565-4574.
Krause J, Johnson J, Krishna R et al. A hierarchical approach for generating descriptive image paragraphs. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.3337-3345.
Yao L, Torabi A, Cho K et al. Describing videos by exploiting temporal structure. In Proc. the IEEE International Conference on Computer Vision, December 2015, pp.4507-4515.
Yu H, Wang J, Huang Z et al. Video paragraph captioning using hierarchical recurrent neural networks. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4584-4593.
Li A, Sun J, Ng J Y H et al. Generating holistic 3D scene abstractions for text-based image retrieval. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1942-1950.
Fellbaum C. WordNet. In Theory and Applications of Ontology: Computer Applications, Poli P, Healy M, Kameas A (eds.), Springer Netherlands, 2010, pp.231-243.
He K, Gkioxari G, Dollár P et al. Mask R-CNN. In Proc. the IEEE International Conference on Computer Vision, October 2017, pp.2980-2988.
Laina I, Rupprecht C, Belagiannis V et al. Deeper depth prediction with fully convolutional residual networks. In Proc. the 4th International Conference on 3D Vision, October 2016, pp.239-248.
Yeh Y T, Yang L, Watson M et al. Synthesizing open worlds with constraints using locally annealed reversible jump MCMC. ACM Transactions on Graphics, 2012, 31(4): Article No. 56.
Pérez P, Gangnet M, Blake A. Poisson image editing. ACM Transactions on Graphics, 2003, 22(3): 313-318.
Article Google Scholar
Liao Z, Karsch K, Forsyth D. An approximate shading model for object relighting. In Proc. the IEEE Conference on Computer Vission and Pattern Recognition, June 2015, pp.5307-5314.
Elder J H. Shape from contour: Computation and representation. Annual Review of Vision Science, 2018, 4(1): 423-450.
Article Google Scholar
Johnston S F. Lumo: Illumination for cel animation. In Proc. the 2nd International Symposium on Non-Photorealistic Animation and Rendering, June 2002, pp.45-52.
Wu T P, Sun J, Tang C K et al. Interactive normal reconstruction from a single image. ACM Transactions on Graphics, 2008, 27(5): Article No. 119.
Grosse R, Johnson M K, Adelson E H et al. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In Proc. the 12th IEEE International Conference on Computer Vision, September 2009, pp.2335-2342.
Karsch K, Sunkavalli K, Hadap S et al. Automatic scene inference for 3D object compositing. ACM Transactions on Graphics, 2014, 33(3): Article No. 32.
Godard C, Aodha M O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.6602-6611.

Download references

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, 430072, China
Fei Fang, Fei Luo, Hong-Pan Zhang, Hua-Jian Zhou & Chun-Xia Xiao
Xiaomi Technology Co. LTD, Beijing, 100085, China
Alix L. H. Chow

Authors

Fei Fang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Luo
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Pan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hua-Jian Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Alix L. H. Chow
View author publications
You can also search for this author in PubMed Google Scholar
Chun-Xia Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chun-Xia Xiao.

Electronic supplementary material

ESM 1

(PDF 884 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, F., Luo, F., Zhang, HP. et al. A Comprehensive Pipeline for Complex Text-to-Image Synthesis. J. Comput. Sci. Technol. 35, 522–537 (2020). https://doi.org/10.1007/s11390-020-0305-9

Download citation

Received: 15 January 2020
Revised: 15 April 2020
Published: 29 May 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11390-020-0305-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Pipeline for Complex Text-to-Image Synthesis

Abstract

Access this article

Similar content being viewed by others

Paired-D GAN for Semantic Image Synthesis

Paired-D++ GAN for image manipulation with text

Text-to-Image Synthesis with Threshold-Equipped Matching-Aware GAN

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Comprehensive Pipeline for Complex Text-to-Image Synthesis

Abstract

Access this article

Similar content being viewed by others

Paired-D GAN for Semantic Image Synthesis

Paired-D++ GAN for image manipulation with text

Text-to-Image Synthesis with Threshold-Equipped Matching-Aware GAN

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation