Skip to main content
Log in

Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Many tasks in computer vision and graphics fall within the framework of conditional image synthesis. In recent years, generative adversarial nets have delivered impressive advances in quality of synthesized images. However, it remains a challenge to generate both diverse and plausible images for the same input, due to the problem of mode collapse. In this paper, we develop a new generic multimodal conditional image synthesis method based on implicit maximum likelihood estimation and demonstrate improved multimodal image synthesis performance on two tasks, single image super-resolution and image synthesis from scene layouts. We make our implementation publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., & Courville, A. (2018). Augmented cyclegan: Learning many-to-many mappings from unpaired data. In ICML.

  • Arjovsky, M., & Bottou, L. (2017). Towards principled methods for training generative adversarial networks. arXiv:1701.04862.

  • Arora, S., & Zhang, Y. (2017). Do GANs actually learn the distribution? an empirical study. arXiv:1706.08224.

  • Bruna, J., Sprechmann, P., & LeCun, Y. (2015). Super-resolution with deep convolutional sufficient statistics. arXiv:1511.05666.

  • Charpiat, G., Hofmann, M., & Schölkopf, B. (2008) Automatic image colorization via multimodal predictions. In European conference on computer vision (pp. 126–139). Springer.

  • Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In IEEE international conference on computer vision (ICCV) (Vol. 1, p. 3).

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Dahl, R., Norouzi, M., & Shlens, J. (2017). Pixel recursive super resolution. In 2017 IEEE international conference on computer vision (ICCV) (pp. 5449–5458).

  • Denton, E. L., Chintala, S., Szlam, A., & Fergus, R. (2015). Deep generative image models using a laplacian pyramid of adversarial networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.) Advances in neural information processing systems 28 (pp. 1486–1494). Curran Associates, Inc. http://papers.nips.cc/paper/5773-deep-generative-image-models-using-a-laplacian-pyramid-of-adversarial-networks.pdf.

  • Donahue, J., Krähenbühl, P., & Darrell, T. (2016). Adversarial feature learning. arXiv:1605.09782.

  • Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., & Courville, A. (2016) Adversarially learned inference. arXiv:1606.00704.

  • Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems (pp. 64–72).

  • Gauthier, J. (2014). Conditional generative adversarial nets for convolutional face generation. Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter Semester, 2014(5), 2.

    Google Scholar 

  • Ghosh, A., Kulharia, V., Namboodiri, V., Torr, P. H., & Dokania, P. K. (2017). Multi-agent diverse generative adversarial networks. arXiv:1704.02906.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.) Advances in neural information processing systems 27 (pp. 2672–2680). Curran Associates, Inc. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.

  • Goodfellow, I. J. (2014). On distinguishability criteria for estimating generative models. arXiv:1412.6515.

  • Gutmann, M. U., Dutta, R., Kaski, S., & Corander, J. (2014). Likelihood-free inference via classification. arXiv:1407.4981.

  • Guzman-Rivera, A., Batra, D., & Kohli, P. (2012). Multiple choice learning: Learning to produce multiple structured outputs. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.) Advances in neural information processing systems 25 (pp. 1799–1807). Curran Associates, Inc. http://papers.nips.cc/paper/4549-multiple-choice-learning-learning-to-produce-multiple-structured-outputs.pdf.

  • Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. arXiv:1804.04732.

  • Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.

  • Kaneko, T., Hiramatsu, K., & Kashino, K. (2017). Generative attribute controller with conditional filtered generative adversarial networks. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7006–7015). IEEE.

  • Karacan, L., Akata, Z., Erdem, A., & Erdem, E. (2016). Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv:1612.00215.

  • Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2015). Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300.

  • Larsson, G., Maire, M., & Shakhnarovich, G. (2016). Learning representations for automatic colorization. In European conference on computer vision (pp. 577–593). Springer.

  • Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A. P., Tejani, A., Totz, J., Wang, Z., et al. (2017). Photo-realistic single image super-resolution using a generative adversarial network. In CVPR (Vol. 2, p. 4).

  • Lee, H. Y., Tseng, H. Y., Mao, Q., Huang, J. B., Lu, Y. D., Singh, M. K., et al. (2018). Drit++: Diverse image-to-image translation via disentangled representations. arXiv:1808.00948.

  • Lee, S., Ha, J., & Kim, G. (2019). Harmonizing maximum likelihood with gans for multimodal conditional generation. arXiv:1902.09225.

  • Li, C., & Wand, M. (2016). Precomputed real-time texture synthesis with markovian generative adversarial networks. In European conference on computer vision (pp. 702–716). Springer.

  • Li, K., & Malik, J. (2016). Fast k-nearest neighbour search via Dynamic Continuous Indexing. In International conference on machine learning (pp. 671–679).

  • Li, K., & Malik, J. (2017). Fast k-nearest neighbour search via Prioritized DCI. In International conference on machine learning (pp. 2081–2090).

  • Li, K., & Malik, J. (2018). Implicit maximum likelihood estimation. arXiv:1809.09087.

  • Ma, L., Jia, X., Georgoulis, S., Tuytelaars, T., & Gool, L. V. (2018). Exemplar guided unsupervised image-to-image translation with semantic consistency. In ICLR.

  • Mathieu, M., Couprie, C., & LeCun, Y. (2015). Deep multi-scale video prediction beyond mean square error. arXiv:1511.05440.

  • Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv:1411.1784.

  • Oh, J., Guo, X., Lee, H., Lewis, R. L., & Singh, S. (2015). Action-conditional video prediction using deep networks in atari games. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.) Advances in neural information processing systems 28 (pp. 2863–2871). Curran Associates, Inc. http://papers.nips.cc/paper/5859-action-conditional-video-prediction-using-deep-networks-in-atari-games.pdf.

  • Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2536–2544).

  • Reed, S. E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., & Lee, H. (2016). Learning what and where to draw. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.). Advances in neural information processing systems 29 (pp. 217–225). Curran Associates, Inc. http://papers.nips.cc/paper/6111-learning-what-and-where-to-draw.pdf.

  • Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European conference on computer vision (ECCV), LNCS (pp. 102–118). Berlin: Springer International Publishing.

    Google Scholar 

  • Sangkloy, P., Lu, J., Fang, C., Yu, F., & Hays, J. (2017). Scribbler: Controlling deep image synthesis with sketch and color. In IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 2).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  • Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. In International conference on machine learning (pp 843–852).

  • Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In: Advances In neural information processing systems (pp 613–621).

  • Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2017). High-resolution image synthesis and semantic manipulation with conditional gans. arXiv:1711.11585.

  • Wang, X., & Gupta, A. (2016). Generative image modeling using style and structure adversarial networks. In European conference on computer vision (pp. 318–335). Springer.

  • Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., et al. (2018). Esrgan: Enhanced super-resolution generative adversarial networks. arXiv:1809.00219.

  • Yang, D., Hong, S., Jang, Y., Zhao, T., & Lee, H. (2019). Diversity-sensitive conditional generative adversarial networks. arXiv:1901.09024.

  • Yoo, D., Kim, N., Park, S., Paek, A. S., & Kweon, I. S. (2016). Pixel-level domain transfer. In European conference on computer vision (pp. 517–532). Springer.

  • Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., & Darrell, T. (2018). Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv:1805.04687.

  • Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision (pp. 649–666). Springer.

  • Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhu, J., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., et al. (2017a). Toward multimodal image-to-image translation. arXiv:1711.11586.

  • Zhu, J. Y., Krähenbühl, P., Shechtman, E., & Efros, A. A. (2016). Generative visual manipulation on the natural image manifold. In European conference on computer vision (pp. 597–613). Springer.

  • Zhu, J. -Y., Park, T., Isola, P., & Efros, A. A. (2017b). Unpaired image-to-image translation using cycle-consistent adversarial networks. In The IEEE international conference on computer vision (ICCV).

  • Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017c). Toward multimodal image-to-image translation. In Advances in neural information processing systems (pp. 465–476).

Download references

Acknowledgements

This work was supported by ONR MURI N00014-14-1-0671. Ke Li thanks the Natural Sciences and Engineering Research Council of Canada (NSERC) for fellowship support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ke Li.

Additional information

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Code for super-resolution is available at https://github.com/niopeng/SRIM-pytorch and code for image synthesis from scene layout is available at https://github.com/zth667/Diverse-Image-Synthesis-from-Semantic-Layout.

Appendices

Appendix A: Baselines with Proposed Rebalancing Scheme

A natural question is whether applying the proposed rebalancing scheme to the baselines would result in a significant improvement in the diversity of generated images. We tried this and found that the diversity is still lacking; the results are shown in Fig. 15. The LPIPS score of CRN only improves slightly from 0.12 to 0.13 after dataset and loss rebalancing are applied. It still underperforms our method, which achieves a LPIPS score of 0.19. The LPIPS score of Pix2pix-HD showed no improvement after applying dataset rebalancing; it still ignores the latent input noise vector. This suggests that the baselines are not able to take advantage of the rebalancing scheme. On the other hand, our method is able to take advantage of it, demonstrating its superior capability compared to the baselines.

Fig. 16
figure 16

Frames of video generated by smoothly interpolating between different latent noise vectors

Fig. 17
figure 17

Frames from two videos of a moving car generated using our method. In both videos, we feed in scene layouts with cars of varying sizes to our model to generate different frames. In a, we use the same latent noise vector across all frames. In b, we interpolate between two latent noise vectors, one of which corresponds to a daytime scene and the other to a night time scene. The consistency of style across frames demonstrates that the learned space of latent noise vectors is semantically meaningful and that scene layout and style are successfully disentangled by our model

Fig. 18
figure 18

Comparison of the difference between adjacent frames of synthesized moving car video. Darker pixels indicate smaller difference and lighter pixels indicate larger difference. a Results for the video generated by our model. b Results for the video generated by pix2pix (Isola et al. 2017))

Appendix B: Additional Results

All videos that we refer to below are available at http://people.eecs.berkeley.edu/~ke.li/projects/imle/scene_layouts.

1.1 B. 1: Video of Interpolations

We generated a video that shows smooth transitions between different renderings of the same scene. Frames of the generated video are shown in Fig. 16.

1.2 B.2: Video Generation from Evolving Scene Layouts

We generated videos of a car moving farther away from the camera and then back towards the camera by generating individual frames independently using our model with different semantic segmentation maps as input. For the video to have consistent appearance, we must be able to consistently select the same mode across all frames. In Fig. 17, we show that our model has this capability: we are able to select a mode consistently by using the same latent noise vector across all frames.

Here we demonstrate one potential benefit of modelling multiple modes instead of a single mode. We tried generating a video from the same sequence of scene layouts using pix2pix (Isola et al. 2017), which only models a single mode. (For pix2pix, we used a pretrained model trained on Cityscapes, which is easier for the purposes of generating consistent frames because Cityscapes is less diverse than GTA-5.) In Fig. 18, we show the difference between adjacent frames in the videos generated by our model and pix2pix. As shown, our model is able to generate consistent appearance across frames (as evidenced by the small difference between adjacent frames). On the other hand, pix2pix is not able to generate consistent appearance across frames, because it arbitrarily picks a mode to generate and does not permit control over which mode it generates.

Appendix C: More Generated Samples

More samples generated by the proposed method are shown in Fig. 19.

Fig. 19
figure 19

Samples generated by our model. The image at the top-left corner is the input semantic layout and the other 19 images are samples generated by our model conditioned on the same semantic layout

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, K., Peng, S., Zhang, T. et al. Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation. Int J Comput Vis 128, 2607–2628 (2020). https://doi.org/10.1007/s11263-020-01325-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01325-y

Keywords

Navigation