Layout2image: Image Generation from Layout

Zhao, Bo; Yin, Weidong; Meng, Lili; Sigal, Leonid

doi:10.1007/s11263-020-01300-7

Layout2image: Image Generation from Layout

Published: 24 February 2020

Volume 128, pages 2418–2435, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Bo Zhao ORCID: orcid.org/0000-0002-2120-2571^1,2,3^na1,
Weidong Yin^1,3^na1,
Lili Meng¹ &
…
Leonid Sigal^1,3

1734 Accesses
16 Citations
Explore all metrics

Abstract

Despite significant recent progress on generative models, controlled generation of images depicting multiple and complex object layouts is still a difficult problem. Among the core challenges are the diversity of appearance a given object may possess and, as a result, exponential set of images consistent with a specified layout. To address these challenges, we propose a novel approach for layout-based image generation; we call it Layout2Im. Given the coarse spatial layout (bounding boxes + object categories), our model can generate a set of realistic images which have the correct objects in the desired locations. The representation of each object is disentangled into a specified/certain part (category) and an unspecified/uncertain part (appearance). The category is encoded using a word embedding and the appearance is distilled into a low-dimensional vector sampled from a normal distribution. Individual object representations are composed together using convolutional LSTM, to obtain an encoding of the complete layout, and then decoded to an image. Several loss terms are introduced to encourage accurate and diverse image generation. The proposed Layout2Im model significantly outperforms the previous state-of-the-art, boosting the best reported inception score by 24.66% and 28.57% on the very challenging COCO-Stuff and Visual Genome datasets, respectively. Extensive experiments also demonstrate our model’s ability to generate complex and diverse images with many objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 11

Image Generation: A Review

Article 11 March 2022

Controlling Style and Semantics in Weakly-Supervised Image Generation

Image Generation from Layout via Pair-Wise RaGAN

References

Caesar, H., Uijlings, J., & Ferrari, V. (2016). Coco-stuff: Thing and stuff classes in context. arXiv: 1612.03716.
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS.
Cheung, B., Livezey, J. A., Bansal, A. K., & Olshausen, B.A. (2015). Discovering hidden factors of variation in deep networks. In ICLR workshop.
de Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., & Courville, A. (2017). Modulating early visual processing by language. In NIPS.
Denton, E., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In NIPS.
Dosovitskiy, A., Tobias Springenberg, J., & Brox, T. (2015). Learning to generate chairs with convolutional neural networks. In CVPR.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computing, 9(8), 1735–1780.
Article Google Scholar
Hong, S., Yang, D., Choi, J., & Lee, H. (2018). Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.
Johnson, J., Gupta, A., & Fei-Fei, L. (2018). Image generation from scene graphs. In CVPR.
Karacan, L., Akata, Z., Erdem, A., & Erdem, E. (2016). Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv:1612.00215.
Kim, J. H., Parikh, D., Batra, D., Zhang, B. T., & Tian, Y. (2017). Codraw: visual dialog for collaborative drawing. arXiv:1712.05558.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In ICLR.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L. J., & Shamma, D. A., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. In IJCV.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Lai, W. S., Huang, J. B., Ahuja, N., & Yang, M. H. (2017). Deep Laplacian pyramid networks for fast and accurate super-resolution. In CVPR.
Lee, H. Y., Tseng, H. Y., Huang, J. B., Singh, M., & Yang, M. H. (2018). Diverse image-to-image translation via disentangled representations. In ECCV.
Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In NIPS.
Ma, L., Sun, Q., Georgoulis, S., Gool, L. V., Schiele, B., & Fritz, M. (2018a). Disentangled person image generation. In CVPR.
Ma, L., Sun, Q., Georgoulis, S., Van Gool, L., Schiele, B., & Fritz, M. (2018b). Disentangled person image generation. In IEEE conference on computer vision and pattern recognition.
Mansimov, E., Parisotto, E., Ba, J. L., & Salakhutdinov, R. (2015). Generating images from captions with attention. arXiv:1511.02793.
Mathieu, M., Zhao, J., Sprechmann, P., Ramesh, A., & LeCun, Y. (2016). Disentangling factors of variation in deep representations using adversarial training. In NIPS.
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv:1411.1784.
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In ICLR.
Murez, Z., Kolouri, S., Kriegman, D., Ramamoorthi, R., & Kim, K. (2018). Image to image translation for domain adaptation. In CVPR.
Nilsback, M., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In Indian conference on computer vision, graphics image processing.
Oord, A. v. d., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. arXiv:1601.06759.
Park, T., Liu, M. Y., Wang, T. C., & Zhu, J. Y. (2019). GauGAN: semantic image synthesis with spatially adaptive normalization. In SIGGRAPH ’19.
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: feature learning by inpainting. In CVPR.
Reed, S., Oord, A. v. d., Kalchbrenner, N., Colmenarejo, S. G., Wang, Z., Belov, D., & de Freitas, N. (2017). Parallel multiscale autoregressive density estimation. arXiv:1703.03664.
Reed, S. E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., & Lee, H. (2016) Learning what and where to draw. In NIPS .
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. In NIPS.
Sangkloy, P., Lu, J., Fang, C., Yu, F., & Hays, J. (2017). Scribbler: Controlling deep image synthesis with sketch and color. In The IEEE conference on computer vision and pattern recognition (CVPR)
Sharma, S., Suhubdy, D., Michalski, V., Kahou, S. E., & Bengio, Y. (2018). ChatPainter: Improving text to image generation using dialogue. arXiv:1802.08216.
Shi, X., Chen, Z., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2015). Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In NIPS.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. In NIPS.
Tan, F., Feng, S., & Ordonez, V. (2018). Text2scene: generating abstract scenes from textual descriptions. arXiv:1809.01110.
van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., & Graves, A., et al. (2016). Conditional image generation with pixelcnn decoders. In NIPS.
Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2017). High-resolution image synthesis and semantic manipulation with conditional GANs. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 8798–8807).
Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8798–8807).
Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., & Perona, P. (2010). Caltech-UCSD birds 200. Technical Report, CNS-TR-2010-001, California Institute of Technology.
Xian, W., Sangkloy, P., Agrawal, V., Raj, A., Lu, J., Fang, C., Yu, F., & Hays, J. (2018). TextureGAN: Controlling deep image synthesis with texture patches. In CVPR.
Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., & Li, H. (2017). High-resolution image inpainting using multi-scale neural patch synthesis. In CVPR.
Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., & Metaxas, D. (2017). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.
Zhang, W., Sun, J., & Tang, X. (2008). Cat head detection—How to effectively exploit shape and texture features. In ECCV.
Zhao, B., Chang, B., Jie, Z., & Sigal, L. (2018). Modular generative adversarial networks. In ECCV.
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017a). Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.
Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017b). Toward multimodal image-to-image translation. In NIPS.

Download references

Acknowledgements

This research was supported, in part, by NSERC Discovery, NSERC DAS and NSERC CFI grants. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.

Author information

Bo Zhao and Weidong Yin have contributed equally to this work.

Authors and Affiliations

Department of Computer Science, The University of British Columbia, Vancouver, Canada
Bo Zhao, Weidong Yin, Lili Meng & Leonid Sigal
Bank of Montreal AI, Toronto, Canada
Bo Zhao
Vector Institute, Toronto, Canada
Bo Zhao, Weidong Yin & Leonid Sigal

Authors

Bo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Yin
View author publications
You can also search for this author in PubMed Google Scholar
Lili Meng
View author publications
You can also search for this author in PubMed Google Scholar
Leonid Sigal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Zhao.

Additional information

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Network Architecture

Here, we describe the detailed network architecture of all our model components in Tables 7, 8, 9, 10, 11 and 12. Here are some notations: CONV: convolutional layers; DECONV: transposed convolutional layers; FC: fully connected layer; CLSTM: convolutional LSTM; AVGPOOL: average pooling layer; CBN: conditional batch normalization; BN: batch normalization; ReLU: rectified linear unit; SUM: summation of feature maps along H & W axis; N: the number of output channels; K: kernel size; S: stride size; P: padding size.

Table 7 Architecture of object estimator

Full size table

Table 8 Architecture of object encoder

Full size table

Table 9 Architecture of objects fuser

Full size table

Table 10 Architecture of image decoder

Full size table

Table 11 Architecture of image discriminator

Full size table

Table 12 Architecture of object discriminator. \(\mathrm {C}\) is the number of object categories

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, B., Yin, W., Meng, L. et al. Layout2image: Image Generation from Layout. Int J Comput Vis 128, 2418–2435 (2020). https://doi.org/10.1007/s11263-020-01300-7

Download citation

Received: 14 April 2019
Accepted: 02 February 2020
Published: 24 February 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11263-020-01300-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Layout2image: Image Generation from Layout

Abstract

Access this article

Similar content being viewed by others

Image Generation: A Review

Controlling Style and Semantics in Weakly-Supervised Image Generation

Image Generation from Layout via Pair-Wise RaGAN

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Network Architecture

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Layout2image: Image Generation from Layout

Abstract

Access this article

Similar content being viewed by others

Image Generation: A Review

Controlling Style and Semantics in Weakly-Supervised Image Generation

Image Generation from Layout via Pair-Wise RaGAN

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Network Architecture

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation