1 Introduction

Generative adversarial networks (GANs) [5] are hailed as one of the most significant developments in machine learning research of the past decade. Since its first introduction, GANs have been applied to a wide range of problems and numerous papers have been published. In a nutshell, GANs are constructed around two functions [4, 11]: the generator G, which maps a sample z to the data distribution, and the discriminator D, which is trained to distinguish real samples of a dataset from fake samples produced by the generator. With the goal of reducing the difference between the distributions of fake and real samples, a GAN training algorithm trains G and D in tandem.

A major challenge of GANs is that controlling the performance of the discriminator is particularly difficult. Kullback–Leibler (KL) divergence was originally used as the loss function of the discriminator to determine the difference between the model and target distributions [16]. However, KL divergence is potentially noncontinuous with respect to the parameters of G, leading to the difficulty in training [2, 23]. Specifically, when the support of the model distribution and the support of the target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from that of the target. Once such a discriminator is found, zero gradients would be backpropagated to G and the training of G would come to a complete stop before obtaining the optimal results. Such a phenomenon is referred to as the vanishing gradient problem.

The conventional form of Lipschitz constraint is given by: \(||f(x_1)-f(x _2)||\le k\cdot ||x_1-x_2||\). It is obvious that Lipschitz constraint requires the continuity of the constrained function and guarantees the boundedness of the gradient norm. Besides, it has been found that enforcing Lipschitz constraint can provide provable robustness against adversarial examples [21], improve generalization bounds [19], enable Wasserstein distance estimation [6], and also alleviate the training difficulty in GANs. Thus, a number of works have advocated the Lipschitz constraint. To be specific, weight clipping was first introduced to enforce the Lipschitz constraint [2]. However, it has been found that weight clipping may lead to the capacity underuse problem where training favors a discriminator that uses only a few features [6]. To overcome the weakness of weight clipping, regularization terms like gradient penalty are added to the loss function to enforce Lipschitz constraint on D [6, 12, 15]. More recently, Miyato et al. [13] introduce spectral normalization to control the Lipschitz constraint of D by normalizing the weight matrix of the layers, which is regarded as an improvement on orthonormal regularization [18]. Using gradient penalty or spectral normalization can stabilize the training and gain-improved performance. However, it has been found that gradient penalty suffers from the problem of not being able to regularize the function at the points outside of the support of the current generative distribution [13]. In addition, spectral normalization has been found to suffer from the problem of gradient norm attenuation [1, 10], i.e., a layer with a Lipschitz bound of 1 can reduce the norm of the gradient during backpropagation, and each step of backprop gradually attenuates the gradient norm, resulting in a much smaller Jacobian for the network’s function than is theoretically allowed. Also as we will show in Sects. 3 and 4.3, these new methods have the capacity underuse problem (see Proposition 1 and Fig. 1). Therefore, despite recent progress, it remains challenging to achieve practical success as well as provably satisfying a Lipschitz constraint.

In this paper, we introduce the boundedness and continuity (BC) conditions to enforce the Lipschitz constraint and introduce a CNN-based implementation of GANs with discriminators satisfying the BC conditions. We make the following contributions:

  1. (a)

    We prove that SN-GANs, one of the latest GAN training algorithms that use spectral normalization, will prevent the discriminator functions from obtaining the optimal solution when applying Wasserstein distance as the loss metric even though the Lipschitz constraint is satisfied.

  2. (b)

    We present BC conditions to enforce the Lipschitz constraint for the GANs’ discriminator functions and introduce a CNN-based implementation of GANs by enforcing the BC conditions (BC-GANs). We show that the performances of BC-GANs are competitive to state-of-the-art algorithms such as SN-GAN and WGAN-GP but having lower computational complexity.

2 Related work

2.1 Generative adversarial networks (GANs)

Generative adversarial networks (GANs) are a special generative model to learn a generator G to capture the data distribution via an adversarial process. Specifically, a discriminator D is introduced to distinguish the generated images from the real ones, while the generator G is updated to confuse the discriminator. The adversarial process is formulated as a minimax game as:

$$\begin{aligned} \underset{G}{\mathrm {min}} \ \underset{D}{\mathrm {max}} V(G, D) \end{aligned}$$
(1)

where min and max of G and D are taken over the set of the generator and discriminator functions, respectively. V(GD) is to evaluate the difference in the two distributions of \(q_x\) and \(q_g\), where \(q_x\) is the data distribution, and \(q_g\) is the generated distribution. The conventional form of V(GD) is given by Kullback–Leibler (KL) divergence: \(E_{x \sim q_{x}}[\mathrm {log}D(x)]+E_{x'\sim q_{g}}[\mathrm {log}(1-D(x'))]\) [16].

2.2 Methods to enforce Lipschitz constraint

Applying KL divergence as the implementation of V(GD) could lead to the training difficulty, e.g., the vanishing gradient problem. Thus, numerous methods have been introduced to solve this problem by enforcing the Lipschitz constraint, including weight clipping [2], gradient penalty [4] and spectral normalization [13].

Weight clipping was introduced by Wasserstein GAN (WGAN) [2], which used Wasserstein distance to measure the differences between real and fake distributions instead of KL divergence.

$$\begin{aligned} W(P_r, P_g) = \mathop {\sup }\limits _{f \in Lip1} \mathop {E}\limits _{x \sim P_r} [f(x)] - \mathop {E}\limits _{x \sim P_g} [f(x)] \end{aligned}$$
(2)

where \(W(P_r, P_g)\) represents the Wasserstein distance, \(P_r\) and \(P_g\) are the real and fake distributions, respectively. Weight clipping enforces the Lipschitz constraint by truncating each element of the weight matrices. Wasserstein distance shows superiority over KL divergence, because it can effectively avoid the vanishing gradient problem brought by KL divergence. In contrast to weight clipping, gradient penalty [6] penalizes the gradient at sample points to enforce Lipschitz constraint:

$$\begin{aligned} {L_D} = E[f(G(z))] - E[f(x)]+\underbrace{\alpha E[{{(||\nabla f(x)|| - 1)}^2}]}_{{\mathrm{gradient \ penalty}}} \end{aligned}$$
(3)

where \(L_D\) is the loss objective for the discriminator, and \(\alpha\) is a hyperparameter.

Spectral normalization is a weight normalization method, which controls the Lipschitz constraint of the discriminator function by literally constraining the spectral norm of each layer. The implementation of the spectral normalization can be expressed as:

$$\begin{aligned} W_{SN}(W): = W/\sigma (W) \end{aligned}$$
(4)

where W represents the weight matrix in each network layer, \(\sigma (W)\) is the spectral norm of matrix W, which equals to the largest singular value of the matrix W, and \(W_{SN}(W)\) represents the normalized weight matrix. To a certain extent, spectral normalization has succeeded in facilitating stable training and improving performance.

3 Existing problems

Although heuristic methods have been proposed to enforce Lipschitz constraint, it is still difficult to achieve a solution that is both practically effective and theoretically provably satisfying the Lipschitz constraint. To be specific, weight clipping was proven to be unsatisfactory in [4], and it can lead to the capacity underuse problem where training favors a discriminator that uses only a few features [6]. In addition, gradient penalty suffers from the obvious problem of not being able to regularize the function at the points outside of the support of the current generative distribution. In fact, the generative distribution and its support gradually change in the course of the training, and this can destabilize the effect of the regularization itself [13]. Moreover, it has been found that spectral normalization suffers from the gradient norm attenuation problem [1, 10]. Furthermore, we have found that applying spectral normalization prevents the discriminator functions from obtaining the optimal solutions when using Wasserstein distance as the loss metric. To provide an explanation to this problem, we present Proposition 1.

Let \(P_r\) and \(P_g\) be the distributions of real images and generated images in X, a compact metric space. The discriminator function f is constructed based on a neural network of the following form with input x:

$$\begin{aligned} f(x,\theta )=W^{L+1}a_L(W^L(a_{L-1}(\cdots a_1(W^1x))))) \end{aligned}$$
(5)

where \(\theta :=\{ W^1, W^2, ..., W^{L+1} \}\) is the learning parameter set, and \(a_l\) is an element-wise nonlinear activation function. Spectral normalization is applied on f to guarantee the Lipschitz constraint.

Proposition 1

When using Wasserstein distance as the loss metric of f, the optimal solution to f is unreachable.

4 Enforcing boundedness and continuity in CNN-based GANs

Finding a proper way to enforce the Lipschitz constraint remains an open problem. Motivated by this, we search for a better way to enforce the Lipschitz constraint.

4.1 BC Conditions

The purpose is to find the discriminator from the set of k-Lipschitz continuous functions [7], which obeys the following condition:

$$\begin{aligned} ||f({x_1}) - f({x_2})|| \le k||{x_1} - {x_2}|| \end{aligned}$$
(6)

Equation (6) is referred to as the Lipschitz continuity or Lipschitz constraint. If the discriminator function f satisfies the following conditions, it is guaranteed to meet the condition of Eq. (6):

  1. (a)

    Boundedness: f is a bounded function.

  2. (b)

    Continuity: f is a continuous function, and the number of points where f is continuous but not differentiable is finite. Besides, if f is differentiable at point x, its derivative is finite.

Conditions (a) and (b) are referred to as the boundedness and continuity (BC) conditions. A discriminator satisfying the BC conditions is referred as a bounded discriminator, and a GAN model with BC conditions enforced is referred to as BC-GAN. Following Theorems 1 and 2 guarantee that meeting the BC conditions is sufficient to enforce the Lipschitz constraint of Eq. (6) (see Proofs in “Appendix”)

Theorem 1

Let \(\Psi\) be the set of all \(f: X \rightarrow R\), where f is a continuous function. In addition, the number of points where f is continuous but not differentiable is finite. Besides, if f is differentiable at point x, its derivative is finite. Then, f in \(\Psi\) satisfies Lipschitz constraint.

Theorem 2

Let \(P_r\) and \(P_g\) be the distributions of real images and generated images in X, a compact metric space. Let \(\Omega\) be the set of all \(f: X \rightarrow R\), where f is a continuous and bounded function. And, the number of points where f is continuous but not differentiable is finite. Besides, if f is differentiable at point x, its derivative is finite. The set \(\Omega\) can be expressed as:

$$\begin{aligned} \Omega :\left\{f | ||f(x)|| \le m,\quad {\mathrm{if }}\frac{{\partial f(x)}}{{\partial x}}{\mathrm{exists,}}\quad ||\frac{{\partial f(x)}}{{\partial x}}|| < \infty \right\} \end{aligned}$$
(7)

where m represents the bound. Then, there must exist a k, and we have a computable \(k \cdot W(P_r, P_g)\):

$$\begin{aligned} k \cdot W({P_r},{P_g}) = \mathop {\sup }\limits _{f \in \Omega } \mathop {E}\limits _{x \sim {P_r}} [f(x)] - \mathop {E}\limits _{x \sim {P_g}} [f(x)] \end{aligned}$$
(8)

where \(W(P_r, P_g)\) represents the Wasserstein distance between \(P_r\) and \(P_g\) [5, 23].

According to Theorems 1 and 2, it is obvious that the BC conditions are sufficient to enforce the Lipschitz constraint. Furthermore, \(k \cdot W(P_r, P_g)\) is bounded and computable and can be obtained as:

$$\begin{aligned} k \cdot W({P_r},{P_g}) = \mathop {\max }\limits _{f \in \Omega } \mathop {E}\limits _{x \sim {P_r}} [f(x)] - \mathop {E}\limits _{_{z\sim {p(z)}}} [f(G(z))] \end{aligned}$$
(9)

Then, \(k \cdot W(P_r, P_g)\) can be applied as a new loss metric to guide the training of D. Logically, the new objective for D is:

$$\begin{aligned} {L_D}=\mathop {\min }\limits _{f \in \Omega } {E_{z\sim {p(z)}}}[f(G(z))] - {E_{x\sim {{P_r}}}}[f(x)] \end{aligned}$$
(10)

Theorem 3 in [2] tells us that

$$\begin{aligned} {\nabla _\theta }kW({P_r},{P_g})= - {E_{z\sim {p(z)}}}[{\nabla _\theta }f(G(z))] \end{aligned}$$
(11)

where \(\theta\) is the parameters of G. Equation (11) indicates that using gradient descent to update the parameters in G is a principled method to train the network of G. Finally, the new objective for G can be obtained:

$$\begin{aligned} {L_G}= \mathop {\min }\limits _\theta - {E_{z\sim {p(z)}}}[f(G(z))] \end{aligned}$$
(12)

4.2 Implementation of BC conditions

In this paper, we introduce a simple but efficient implementation of BC conditions. When applying the BC conditions to D, the training of D can be equivalently regarded as a conditional (constrained) optimization process. Then, Eq. (10) can be updated as:

$$\begin{aligned} &\mathop {\min }\limits _{f \in \Omega } \{ {E_{z\sim {p(z)}}}[f(G(z))] - {E_{x\sim {{P_r}}}}[f(x)]\} \\&s.t.{{ ||}}f(x)|| \le m,\quad{\mathrm{if }}\frac{{\partial f(x)}}{{\partial x}}{\mathrm{exists, }}\quad ||\frac{{\partial f(x)}}{{\partial x}}|| < \infty \end{aligned}$$
(13)
figure a

In this paper, the discriminator function f is implemented by a deep neural network, which applies a series of convolutional and nonlinear operations. Both convolutional and nonlinear functions are continuous, which means that D is a continuous function. Moreover, the gradients of the output of D with respect to the input are always finite. As a result, condition (b) is satisfied naturally. To guarantee condition (a), the Lagrange multiplier method can be applied here; then, the objective of D can be written as the following equation:

$$\begin{aligned} \begin{aligned} {L_D}=&\mathop {\min }\limits _f \{ {E_{z\sim {p(z)}}}[f(G(z))] - {E_{x\sim {{P_r}}}}[f(x)]\} \\&+ \beta \cdot {\mathrm{max}}(||[f(x)]{\mathrm{||}} - m,0) \end{aligned} \end{aligned}$$
(14)

where \(\beta\) is the hyperparameter and m represents the bound. The term \(\mathrm{max}(\left\| f(x) \right\| -m, 0)\) plays the role of forcing D to be a bounded function, while \(E_{z \sim p(z)}\left[ f(G(z))\right] - E_{x \sim p(x)}\left[ f(x) \right]\) is used to determine \(k \cdot W(P_r, P_g)\). The procedure of training the BC-GAN is described in Algorithm 1.

4.3 Validity

In order to verify the validity of proposed BC conditions, we use synthetic datasets as those presented in [15] to test discriminator’s performance. Specifically, discriminators are trained to distinguish the fake distribution from the real one. The toy distributions hold the fake distribution \(P_g\) as the real distribution \(P_r\) plus unit-variance Gaussian noise. Theoretically, discriminator with good performance is more likely to learn the high moments of the data distributions and model the real distribution. Figure 1 illustrates the value surfaces of the discriminator. It is clearly seen that discriminator enforced by BC conditions has a good performance on discriminating the real samples from the fake ones, demonstrating the validity of proposed method.

4.4 Comparison with spectral normalization and gradient penalty

Gradient penalty, spectral normalization and our proposed method are inspired by different motivations to enforce the Lipschitz constraint on D. Therefore, they differ in the way of implementation and in principle. The first difference is the way of implementation. Gradient penalty and our method operate on the loss function directly, while spectral normalization constrains the weight matrix instead of the loss metric.

Secondly, they differ in principle. For BC-GAN, \(k \cdot W(P_r, P_g)\) is applied to evaluate the difference between the fake and real distributions instead of \(W(P_r, P_g)\), which is used in WGAN-GP and WGAN. Moreover, WGAN-GP and SN-GAN strictly constrain the Lipschitz constant to be 1 or a known constant, while BC-GAN eases the restriction on the Lipschitz constant, and k is an unknown scalar parameter which will have no influence on the training of the network. Therefore, \(k \cdot W(P_r, P_g)\) can be employed as a new loss metric to guide the training of D.

To visualize the differences, we still use the synthetic datasets to test discriminators’ performance. Figure 1 illustrates the value surfaces of the discriminators. It is obvious that discriminators trained with gradient penalty as well as spectral normalization have pathological value surfaces even when optimization has completed, and they have failed to capture the high moments of the data distributions and instead model very simple approximations to the optimal functions. In contrast, BC-GANs have successfully learned the higher moments of the data distributions, and the discriminator can distinguish the real distribution from the fake one much better.

Fig. 1
figure 1

Value surface of the discriminators trained to optimality on toy datasets. The yellow dots are data points, the lines are the value surfaces of the discriminators. Left column: spectral normalization. Middle column: gradient penalty. Right column: the proposed method. The upper, middle and lower rows are trained on 8-Gaussian, 25-Gaussian and the Swiss roll distributions, respectively. The generator is held fixed at real data plus unit-variance Gaussian noise. It is seen that discriminators trained with gradient penalty as well as spectral normalization have failed to capture the high moments of the data distribution (color figure online)

4.5 Convergence measure

One advantage of using Wasserstein distance as the metric over KL divergence is the meaningful loss. The Wasserstein distance \(W(P_r, P_g)\) shows the property of convergence [6]. If it stops decreasing, then the training of the network can be terminated. This property is useful as one does not have to stare at the generated samples to figure out the failure modes. To obtain the convergence measure in the proposed BC-GAN, a corresponding indicator of the training stage is introduced:

$$\begin{aligned} I_{GD}=\frac{1}{{||{\nabla _x}f(x)|{|_2}}} \end{aligned}$$
(15)

To prove that proposed indicator \(I_{GD}\) is capable of convergence measure, Theorem 3 is introduced.

Theorem 3

Let \(P_r\) and \(P_g\) be the distributions of real and generated images, x is the image located in \(P_r\) and \(P_g\), and f is the discriminator function, bounded by the BC conditions. \(I_{GD}\) in Eq. 15is proportional to \(W(P_r, P_g)\).

5 Experiments

5.1 Experimental setup

In order to assess the performance of BC-GAN, image generation experiments are conducted on CIFAR-10 [20], STL-10 [8] and CELEBA [25] datasets. Two widely used GAN architectures, including the standard CNN and ResNet-based CNN [6], are applied for image generation task. For the architecture details, please see “Appendix”. Equations (12) and (14) are used as the loss metric of D and G, respectively. \(I_{GD}\) in Eq. (15) acts as the role of measuring convergence. m and \(\beta\) in Eq. (14) are set as 0.5 and 2, respectively. For optimization, the Adam [9] is utilized in all the experiments with \(\alpha\) = 0.0002, \(\beta _1=0\), \(\beta _2=0.9\). D updates 5 times per G update. To keep it identical to previous GANs, we set the batch size as 64. Inception score [17] and Fréchet inception distance [8] are utilized for quantitative assessment of generated examples.

Although inception score and Fréchet inception distance are widely used as an evaluation metric for GANs, Barratt [3] suggests that it should be more systematic and careful when evaluating and comparing generative models, because inception score may not correlate well with the image quality strictly. Recently, Catherine [14] proposes a new method to evaluate the generative models, called skill rating. Skill rating evaluates models by carrying out tournaments between the discriminators and generators. For better evaluation, results assessed by skill rating are also presented.

5.2 Results on image generation

Image generation tasks are carried out on the CIFAR-10 and STL-10 datasets. Based on the ResNet-based CNN architecture, we obtain the average inception score of 8.40 and 9.15 for image generation on CIFAR-10 and STL-10, respectively. We compare our algorithm against multiple benchmark methods. In Table 1, we show the inception score and Fréchet inception distance of different methods with their corresponding optimal settings on CIFAR-10 and STL-10 datasets. As illustrated in Table 1, BC-GAN has comparable performances with the state-of-the-art GANs. We also conduct image generation on CELEBA [25] dataset. Examples of generated images are shown in Fig. 2 and 3

Table 1 IS and FID of unsupervised image generation on CIFAR-10 and STL-10. IS is the inception score, and FID represents Fréchet inception distance. For IS, higher is better, while lower is better for FID

.

Fig. 2
figure 2

Image generation on CIFAR-10 dataset using a SN-GAN, b WGAN-GP and c BC-GAN

Fig. 3
figure 3

Image generation on CELEBA dataset using a SN-GAN, b WGAN-GP and c BC-GAN

Fig. 4
figure 4

Matches between D and G. Wasserstein distance is utilized to indicate the results instead of the win rate. With larger value of the Wasserstein distance, D is more likely to distinguish the real images from the fake ones. Lower value of the Wasserstein distance indicates that G is more likely to fool D

Skill rating [14] is recently introduced to judge the GAN model by matches between G and D. To determine the outcome of a match between G and D, D judges two batches: one batch of samples from G and one batch of real data. Every sample x that is not judged correctly by D (e.g., D(x) >0.5 for the generated data or D(x) <0.5 for the real data) counts as a win for G and is used to compute its win rate. Win rate tests the performance between D and G dynamically in the training process and judges whether D or G dominates, while the other stops updating. If D dominates and G stops updating, win rate for G decreases dramatically. We make some modifications, because we use Wasserstein distance to determine the difference between fake and real data instead of probability. As a result, we show the loss of D instead of the win rate in Fig. 4. When D in the latter iteration is used to distinguish the generated images in the early iteration from real images, it outputs a large loss, meaning that D can easily distinguish the generated images (fake images) from real images. And the images generated in the latter iteration can also easily fool D in the early iteration. Therefore, there is a healthy training, and the performance of D and G is continuously improved in the training process.

When applying KL divergence as the loss metric of D, the training of GANs suffers from the vanishing gradient problem, i.e., zero gradient would backpropagate to G, and the training would completely stop. As a comparison, Fig. 4 shows a healthy training during the entire iterations, further indicating the effectiveness of BC-GANs.

6 Analysis

6.1 Bound m

The parameter m in Eq. (14) represents the bound of D, and it actually controls the gradient \(\partial L_{D}\)/\(\partial\)x, where \(L_D\) is the loss of D, x is the image and \(\partial L_{D}\)/\(\partial\)x is the gradient backpropagated from D to G, which indeed affects the training of G and further influences the model performance. Explanation is as followed. The discriminator f is a bounded function. Given enough iterations, \({f_{x\sim Pr}}(\textit{x})\) would always converge to m and \({f_{x \sim Pg}}(\textit{x})\) would converge to \(-m\). And considering that f satisfies k-Lipschitz constraint, the following condition is satisfied:

$$\begin{aligned}&||{f_{{x_r}\sim {{P_r}}}}({x_r}) - {f_{_{{x_g}\sim {{P_g}}}}}({x_g})|| \approx 2m \le k||{x_r} - {x_g}|| \end{aligned}$$
(16)
$$\begin{aligned}&\frac{{2m}}{{||{x_r} - {x_g}||}} \le k \end{aligned}$$
(17)

k determines the upper bound of the gradient backpropagated from D to G and is directly proportional to D. Increasing m enhances the upper bound of the gradients \(\partial L_{D}/\partial\)x. This is verified by the experiment shown in Fig. 5a. Moreover, the gradients are used to guide the training of the generator and naturally affect the performance of the model. Increasing m from 0.5 to 2 leads to decreased performance (inception score drops from 8.40 to 7.56). Therefore, properly controlling the gradient is important for improving the performance of GAN models. And the bound m provides such a mechanism for controlling the gradient. m is recommended to be taken as 0.5 for image generation task on CIFAR-10. One possible explanation why a smaller m (hence smaller gradients backpropagated) in the training leads to better performances is that the error surfaces are highly nonlinear, the backpropagation is a gradient descent and greedy algorithm, small gradients may help the optimization lead to a deeper local minimum or indeed the global minimum of the error surface.

Fig. 5
figure 5

a Variation of the gradient \(\partial {L_{D}}/\partial\)x with iterations in BC-GAN. Larger m leads to higher gradients. b Variation of the gradient with iterations in WGAN-GP. c Comparison of the gradient variation of SN-GAN and BC-GAN, where SN represents SN-GAN, and BC is BC-GAN

We also monitor the variation of the gradient on WGAN-GP and SN-GAN. It is found that the behavior of the gradient variation varies on different models. The gradient penalty term in WGAN-GP forces the gradient of the output of D with respect to the input to be a fixed number. Therefore, as shown in Fig. 5b, the gradient is around 1 in the whole training process. For SN-GAN and our BC-GAN in Fig. 5c, the variation of the gradient is similar. With training process going on, the gradient tends to increase until convergence is reached. The difference is that the amplitude of the gradient in SN-GAN is larger than that in BC-GAN. As mentioned above, the amplitude of the gradient indeed affects the training of the generator. However, SN-GAN provides no mechanism for controlling the gradient, while the bound m in BC-GAN acts as the role of controlling the gradient. Thus, at least in this perspective, BC-GAN has a better performance control over SN-GAN.

6.2 Meaningful training stage indicator \(I_{GD}\)

We introduce a new indicator \(I_{GD}\) for monitoring the training stage. Figure 6a shows the correlation of\(-I_{GD}\) with inception score during the training process. Because \(I_{GD}\) decreases with the iteration, we use \(-I_{GD}\) instead. As we can see, \(-I_{GD}\) has a positive correlation with the inception score. As it is easier to visualize the correlation between \(I_{GD}\) and image quality in higher-resolution images, we perform image generation task on CELEBA [25] dataset and show the variation of \(I_{GD}\) with iterations in Fig. 6b . It is clearly seen that \(I_{GD}\) correlates well with image quality during the training process.

Fig. 6
figure 6

a Correlation of −\({I_{GD}}\) with inception score on CIFAR10. b Variation of \({I_{GD}}\) with iteration for the training on CELEBA database. \({I_{GD}}\) correlates well with the image quality, indicating that \(I_{GD}\) can be regarded as the indicator of the training stage

Fig. 7
figure 7

Computation time for 100 generator updates. GP for WGAN-GP and SN for SN-GAN. We use standard CNN as the architecture. Tests are based on Nvidia 1080Ti

6.3 Training time

It is worth noting that BC-GAN is computationally efficient. We list the computational time for 100 generator updates in Fig. 7. WGAN-GP requires more computational time because it needs to calculate the gradient of the gradient norm \(\Vert \triangledown _{\textit{x}}D\Vert _{2}\), which needs one whole round of forward and backward propagation. And spectral normalization needs to calculate the largest singular value of the matrices in each layer. What is worse, for gradient penalty and spectral normalization, the extra computational costs increase with the increase in layers. As for BC-GAN, there is no matrix operation or gradient calculation in the backpropagation. As a result, it has lower computational cost.

7 Concluding remarks

In this paper, we have introduced a new generative adversarial network training technique called BC-GAN which utilizes bounded discriminator to enforce Lipschitz constraint. In addition to provide theoretical background, we have also presented practical implementation procedures for training BC-GAN. Experiments on synthetical as well as real data show that the new BC-GAN performs better and has lower computational complexity than recent techniques such as spectral normalization GAN (SN-GAN) and Wasserstein GAN with gradient penalty (WGAN-GP). We have also introduced a new training convergence measure which correlates directly with the image quality of the generator output and can be conveniently used to monitor training progress and to decide when training is completed.