Training bidirectional generative adversarial networks with hints

https://doi.org/10.1016/j.patcog.2020.107320Get rights and content

Highlights

  • The BiGAN has an encoder, in addition to the generator and discriminator of GAN.

  • This encoder coupled with the generator allows defining extra loss terms as hints.

  • We experiment on five image data sets, MNIST, UT-Zap50K, GTSRB, Cifar10, and CelebA.

  • With these different hints, BiGAN generates higher quality and more diverse images.

Abstract

The generative adversarial network (GAN) is composed of a generator and a discriminator where the generator is trained to transform random latent vectors to valid samples from a distribution and the discriminator is trained to separate such “fake” examples from true examples of the distribution, which in turn forces the generator to generate better fakes. The bidirectional GAN (BiGAN) also has an encoder working in the inverse direction of the generator to produce the latent space vector for a given example. This added encoder allows defining auxiliary reconstruction losses as hints for a better generator. On five widely-used data sets, we showed that BiGANs trained with the Wasserstein loss and augmented with hints learn better generators in terms of image generation quality and diversity, as measured numerically by the 1-nearest neighbor test, Fréchet inception distance, and reconstruction error, and qualitatively by visually analyzing the generated samples.

Introduction

In generative modeling, we have a data set of {xt}t drawn from an unknown probability distribution p(x) and we would like to be able to generate new x that also look like they have been drawn from p(x). For example, xt may be the face images of a collection of people and we would like to be able to generate new face images; these new synthetic images would be legitimate faces but of people that do not exist.

The typical approach would be to learn some estimator to p(x) (e.g., using a Gaussian distribution) and then to sample from that estimator. The approach we have in this paper defines generative modeling as a mapping task where a generator function takes some low-dimensional z drawn from a given p(z) as input and transforms it into a valid instance x from p(x). All the structure that p(x) has (for example, all the requirements of being a face image) needs to be captured during learning so that the newly generated x also reflect those.

In many real-world applications that involve, for instance, images, speech, or text, our observations x are high-dimensional; at the same time, we know that all these dimensions are not all necessary or independent. An important research area in machine learning is hence dimensionality reduction where we want to map x to a much lower-dimensional z-space without any loss of information, and many methods, e.g., principal components analysis (PCA), have been proposed to learn such a mapping. In a generative model, we posit that the dimensions of z are latent factors that interact to generate the observed x; one example model is factor analysis (FA), which goes in the opposite direction of PCA.

Unsupervised dimensionality reduction can be learned using the neural network architecture called the autoencoder (AE) (Fig. 1). The encoder part compresses x to z (as in PCA) and the decoder part generates x from z (as in FA). The two networks back-to-back are trained to reconstruct the input, that is, to minimize the difference between the output of the decoder and the input to the encoder. In the simplest case, both the encoder and the decoder are one-layer (i.e., linear) networks and in this case, it has been shown that the encoder spans the same subspace as PCA, but with the encoder and the decoder having more layers, the AE realizes nonlinear dimensionality reduction with z corresponding to more interesting abstract features of the input.

Typically, the encoder and the decoder are taken to be inverses of each other in terms of network architecture. For example with image data, the encoder starts with one or more convolution layers that successively downsample followed by one or more dense layers decreasing dimensionality at each layer; the decoder starts from there and increases dimensionality at each layer starting with one or more dense layers and ending with one or more upsampling deconvolution layers to generate the image back again.

The autoencoder is not a generative model; for any x, we can find the corresponding z and then reconstruct x, but we have no way of generating new x outside of the training set. In the variational autoencoder (VAE) [1], we consider zt as random variables sampled from a known distribution p(z) (e.g., Gaussian), and we add an extra term to the reconstruction error to enforce this. Once training is done, we can sample from this p(z) and use the decoder to generate new x.

In this paper, we extend the generative adversarial network (GAN) [2] that has recently been shown to work better than the VAE as a generative model. The original GAN model is composed of two networks, a generator G and a discriminator D (Fig. 2). Both G and D are deep neural networks with convolutional and dense layers as appropriate. The generator takes a latent vector z as input and generates an observation vector x, where z are low-dimensional and are sampled from an assumed probability distribution p(z) (e.g., multivariate Gaussian with independent features). Once training is done, we can generate new x by sampling new z from p(z) and passing them through G.

The samples generated by G are called fake; they are the adverse examples to the true xt that we have in our training set. The aim of the discriminator is to tell the true and fake samples apart as well as possible, and that is how it is trained. The aim of the generator on the other hand is to generate fakes so well that the discriminator cannot tell them apart from the true samples. The two networks G and D play an adversarial game and gradually improve their abilities: As G gets to generate better fakes, D gets better at detecting them, which in turn forces G to get even better, and so on.

The following log-likelihood criterion is maximized by D and minimized by G:LGAN=xtXlogD(xt)+ztp(z)log(1D(G(zt)))Here, xt are the true samples drawn from the training set X and G(zt) are the fake samples with zt sampled from p(z).

Since its inception, the GAN model and its many variants have been successfully used in many applications, in image, video, text and music generation; see [3] for a survey.

Despite their various successful applications, it has been seen that training GANs is difficult and several empirical tips and tricks have been proposed to improve convergence, such as label smoothing, mini-batch discrimination, and feature matching [4].

Our approach in this paper involves adding an auxiliary loss term to that of GAN and optimize the resulting augmented criterion in training. This added term provides a “hint” that directs the learning process towards a better generator. We propose a general framework that defines how such hints can be included in training and show four variants. To be able to define the type of hint we have, we use the bidirectional form of the GAN. The original GAN can generate x for any z but does not have an inverse mapper for generating the corresponding z for a given x. The bidirectional GAN (BiGAN) [5], [6] also includes an encoder component, and this encoder allows us to define various loss functions to train better generators. This new encoder component (x → z), which is also implemented as a deep neural network, works just like the encoder of the AE, and the generator of the GAN (z → x) works just like the decoder of AE, and we use this correspondence in defining the auxiliary loss functions.

In addition to training a better generator, having an encoder can also be useful in different scenarios: Once we have such a mechanism, by investigating how z changes as x is changed, we can assign meaning to the different dimensions of z [7] and this allows knowledge extraction: For example, we can do “vector algebra” where abstract features such as putting on glasses corresponds to adding a vector in the z-space. Because such an encoder works as a dimensionality reducer trained with unlabeled data, it can be used as a preprocessor before a later classifier or regressor in a semi-supervised setting.

In Section 2.1, we discuss the BiGAN model trained using the original log-likelihood criterion as well as with the Wasserstein loss introducing the Wasserstein BiGAN. We introduce the auxiliary reconstruction criteria for training BiGANs in Section 3. Our experimental results on five image data sets are given in Section 4 and we conclude in Section 5.

Section snippets

The bidirectional GAN

The original GAN can generate x for any z but does not have an inverse mapper for generating the corresponding z for any given x. The Bidirectional GAN (BiGAN) [5] and the equivalent Adversarially Learned Inference (ALI) [6] models were proposed independently and contain also an encoder component E mapping true x to z (Fig. 3). Unlike the GAN where the discriminator sees only x as input, in the BiGAN, D sees both x and z, i.e., the observation and its latent representation together. For a true

Motivation

Training a GAN is difficult because of a number of reasons:

  • 1.

    Though through the concept of adversarial training it is cast as a supervised problem, training a generator is in fact an unsupervised learning task, and unsupervised learning is known to be more difficult because there is less feedback.

  • 2.

    There are two models D and G to train and hence the problem of model selection is multiplied by two. Both are typically implemented by many-layered deep networks and the depth and width of both should be

Setting

We use five well-known real-world image data sets frequently used to test GANs; they are MNIST, UT-Zap50K shoes, German Traffic Sign Recognition Benchmark (GTSRB), Cifar10, and CelebA. MNIST consists of 60,000 handwritten grayscale digit images each of size 28 × 28, which we resize to 32 × 32 for convenience. UT-Zap50K data set contains 50,025 shoe images in RGB; images are of varying sizes and we resize them to 32 × 32. The training set of GTSRB contains 39,209 traffic sign images which we

Conclusions

We applied the Wasserstein loss to BiGAN and also extended it by using additional loss criteria. After our experiments on MNIST, UT-Zap50K, GTSRB, Cifar10 and CelebA data sets, we have reached the following findings:

  • The autoencoder structure of BiGAN allows defining a reconstruction error which can be used to define different loss criteria as hints. We see that suitably defined hints lead to improved quality in generation.

  • We find that the variant that works in the original image space (DS) and

Acknowledgements

This work is partially supported by Boğaziçi University Research Funds with Grant Number 18A01P7. We also thank TETAM for the computing facilities provided.

Uras Mutlu recevied his B.Sc. degree on computer engineering from Istanbul Technical University in 2016 and his M.S. degree on computer engineering from Boǧaziçi University in 2019. He is currently a Ph.D. student with research interests such as generative adversarial networks, computer vision, natural language processing, and deep learning in general.

References (32)

  • A. Atapour-Abarghouei et al.

    Generative adversarial framework for depth filling via Wasserstein metric, cosine transform and domain transfer

    Pattern Recognit.

    (2019)
  • W. Xu et al.

    Toward learning a unified many-to-many mapping for diverse image translation

    Pattern Recognit.

    (2019)
  • D. P. Kingma, M. Welling, Auto-encoding variational Bayes, arXiv:1312.6114...
  • I. Goodfellow et al.

    Generative adversarial nets

    Advances in Neural Information Processing Systems 27

    (2014)
  • Y. Hong, U. Hwang, J. Yoo, S. Yoon, How generative adversarial networks and their variants work: an overview,...
  • M. Arjovsky et al.

    Towards principled methods for training generative adversarial networks

    International Conference on Learning Representations

    (2017)
  • J. Donahue, P. Krähenbühl, T. Darrell, Adversarial feature learning, arXiv:1605.09782...
  • V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, A. Courville, Adversarially learned...
  • A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial...
  • M. Arjovsky et al.

    Wasserstein generative adversarial networks

    International Conference on Machine Learning

    (2017)
  • T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANs for improved quality, stability, and variation,...
  • I. Gulrajani et al.

    Improved training of Wasserstein GANs

    Advances in Neural Information Processing Systems 30

    (2017)
  • G.-J. Qi, Loss-sensitive generative adversarial networks on Lipschitz densities, arXiv:1701.06264...
  • C. Szegedy et al.

    Rethinking the inception architecture for computer vision

    IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • A.B.L. Larsen, S.K. Sønderby, H. Larochelle, O. Winther, Autoencoding beyond pixels using a learned similarity metric,...
  • A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial autoencoders, arXiv:1511.05644...
  • Cited by (15)

    • GL-GAN: Adaptive global and local bilevel optimization for generative adversarial network

      2022, Pattern Recognition
      Citation Excerpt :

      In images completion task, Qiang Wang et al. [32] incorporate deep generative adversarial networks with a Laplacian pyramid mechanism to recover the spatial information of missing face regions in a coarse-to-fine manner. Uras Mutlu et al. [33] present an encoder working in the inverse direction of the generator to provide auxiliary reconstruction losses as hints for a better generator. CG-GAN [34] applies generative and evolutionary computation to allow casual users to interactively breed and edit faces.

    • Unsupervised anomaly detection for underwater gliders using generative adversarial networks

      2021, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      An earlier study by Li et al. (2018) applied GAN to detect cyber attacks, using multivariate time series with the need of the inference process to map the test data back to latent space. Although these GAN-based anomaly detection systems appear successful in the applied domains, the GAN-based anomaly detection system is still relatively difficult to train for reasons including its unsupervised nature and the generative and adversarial process between multiple deep neural networks (Mutlu and Alpaydin, 2020). Despite GAN-based anomaly detection systems’ success in other domains, they have not been applied to MAS, which are subjected to limited accessibility to system data and require a high level of generality to detect unpredicted anomalies in highly dynamic ocean environments.

    View all citing articles on Scopus

    Uras Mutlu recevied his B.Sc. degree on computer engineering from Istanbul Technical University in 2016 and his M.S. degree on computer engineering from Boǧaziçi University in 2019. He is currently a Ph.D. student with research interests such as generative adversarial networks, computer vision, natural language processing, and deep learning in general.

    Ethem Alpaydın received his Ph.D. degree from Ecole Polytechnique Fédérale de Lausanne, Switzerland, in 1990, and was a postdoc at the International Computer Science Institute, Berkeley in 1991. He is a Professor in the Department of Computer Engineering, Boǧaziçi University, Istanbul and a Member of the Science Academy, Istanbul. He was a Visiting Researcher at MIT in 1994, IDIAP, in 1998, and TU Delft, in 2014. He was a Fulbright scholar, in 1997. The third edition of his book Introduction to Machine Learning was published by The MIT Press, in 2014.

    View full text