Abstract

Text-to-image synthesis is an important and challenging application of computer vision. Many interesting and meaningful text-to-image synthesis models have been put forward. However, most of the works pay attention to the quality of synthesis images, but rarely consider the size of these models. Large models contain many parameters and high delay, which makes it difficult to be deployed on mobile applications. To solve this problem, we propose an efficient architecture CPGAN for text-to-image generative adversarial networks (GAN) based on canonical polyadic decomposition (CPD). It is a general method to design the lightweight architecture of text-to-image GAN. To improve the stability of CPGAN, we introduce conditioning augmentation and the idea of autoencoder during the training process. Experimental results prove that our architecture CPGAN can maintain the quality of generated images and reduce at least 20% parameters and flops.

1. Introduction

Text-to-image synthesis is a challenging cross modal generation which generates images according to given texts. It extracts the common modal data from texts and transfers the semantic data into images. Text-to-image synthesis plays a more and more important role in computer vision. Images were edited by images in the past. With the development of text-to-image synthesis, images can also be edited by text, which greatly expands the application of computer vision. Text-to-image synthesis can be widely applied in human-computer interaction, such as cross modal retrieval [1] and artistic creation [2, 3].

Traditional text-to-image synthesis used variational autoencoder (VAE), attention mechanism, and recurrent neural network (RNN) to generate images step by step [4, 5]. Limited by generative ability of VAE, generated images are not as clear as real images. A new generative model GAN was proposed by Goodfellow et al. in 2014 [6]. GAN becomes a popular model in image generation task due to its strong generating ability. Reed et al. [7] proved that GAN could be used to generate clear images from text description and proposed GAN-int-cls. It uses DCGAN as the backbone, text embedding, and random noises as inputs of the generator. The generated images, text embedding, and real images are inputs of the discriminator. Subsequently, many sophisticated models were proposed. These models can generate images according to general text, scene graph, or dialog. The quality of generated images has been improved a lot.

However, these models introduced many constraints and modules to generate realistic images. These will greatly increase parameters and floating-point operations per second (flops) of models. It will require more and more hardware resources (CPU, GPU, memory, and bandwidth) to deploy these models. High complexity also leads to high latency. This greatly limits application of text-to-image GAN in mobile terminal. It is necessary to compress text-to-image GAN. Canonical polyadic decomposition (CPD) is an easy and efficient way to compress and accelerate model in tensor decomposition. Many implementations of convolutional neural networks (CNN) compression based on CPD [79] have already been proposed.

In this paper, we propose a general compressed architecture CPGAN for designing text-to-image GAN to reduce parameters and flops. CPGAN redesigns each layer of the original neural network by using CPD. The original convolution layer is decomposed into three convolution layers with different ranks and small size. A layer with a smaller rank has few parameters. According to the needs of the application, we can design architectures with different compression ratios by setting different ranks. During the training process of models with different ranks, it is time-consuming to select the appropriate learning rate. To this end, we use cyclical learning rate (CLR) [11] method to select the optimal learning rate for the redesigned architecture. In addition, GAN has the problem of unstable training. CPGAN is a deeper architecture than the classical GAN and is difficult to train from scratch. To solve this problem, we add conditioning augmentation module and introduce the idea of autoencoder method.

Our contributions can be summarized as follows:(i)We propose CPGAN to reduce parameters and maintain the generative ability of text-to-image GAN. It is a general method to design the lightweight architecture of GAN.(ii)To reduce high resource consumption caused by decomposition operation, we train CPGAN from scratch and do not need to pretrain the model. To the best of our knowledge, it is the first time to use CPD to design text-to-image GAN without using pretrained model.(iii)To stable the end-to-end training, we introduce the idea of autoencoder. The added decoder modules can be removed after training.

Experimental results on two representative cross modal datasets (Oxford-102 and CUB) prove that our architecture CPGAN can maintain the quality of generated images and reduce parameters and flops of original model effectively at the same time. In Oxford-102 and CUB, CPGAN performs better in inception score (IS) and Fréchet inception distance (FID) than original model. It reduces flops and parameters in Oxford-102. These show that our architecture can efficiently redesign text-to-image GAN without loss of image quality.

The rest of the paper is organized as follows. The work related to our paper is introduced in Section 2. In Section 3, we propose the efficient architecture CPGAN of text-to-image generative adversarial networks (GAN) based on canonical polyadic decomposition (CPD). Section 4 describes experimental settings and experimental results. Finally, we conclude this paper in Section 5.

2.1. Canonical Polyadic Decomposition

The essence of neural network is the matrix transformation process of input data matrix using weight parameters. Each layer of neural network is a large tensor, which can be decomposed into several small tensors. Canonical polyadic decomposition (CPD) is a standard tensor decomposition method. It was proposed by Hitchcock in 1927 [12]. It can decompose a tensor into a sum of rank-one tensors. CPD has been applied in psychometrics [13], signal processing [14], computer vision [15], data mining [16], and elsewhere. It also performs well in model compression.

Denton et al. [8] used CPD to approximate the original convolution kernel and presented two methods of improving approximation criterion. They performed fine-tuning on the decomposed kernels by fixing other layers. Jaderberg et al. [9] applied CPD to decompose a 4D kernel into two small kernels and use two methods to reconstruct the original filters. Lebedev et al. [10] used CPD to decompose the 4D convolution kernel tensor into four small kernels with nonlinear least squares and then replace original layer. Then, they fine-tuned the entire network using backpropagation. Lebedev et al.’s [10] method accelerated the second convolutional layer of AlexNet by 6.6 times at the cost of 1% accuracy loss. This exceeded the other two works, where Denton et al. [8] got 2 times speed-up and Jaderberg et al. [9] got 4.5 times speed-up at the cost of 1% accuracy loss.

Astrid et al. [17] proposed a CNN compression method based on CPD : CP-TPM. It achieved 6.98 times parameter reduction and 3.53 times speeding-up in AlexNet. It is better than the Tucker-based method [18] in the same network. Zhang et al. [19] and Tai et al. [20] also applied CPD to compress CNN. Original layers are pretrained to minimize the difference between the decomposed layer and the original tensor in the models of Astrid et al. [17], Zhang et al. [19], and Tai et al. [20]. Because CP decomposition operation consumes extensive resources, we do not decompose the pretrained weight tensor, but directly use CP decomposition to design an efficient architecture in text-to-image GAN.

2.2. Text-to-Image Synthesis

Text-to-image synthesis is a branch of computer vision which generates images according to given texts. It can be used for image editing, cross modal retrieval, and artistic creation. GAN has strong generating ability. It can generate realistic images and has been widely used for image generation. Since Reed et al. [7] first successfully used GAN for text image generation, GAN has also become a popular model in text-to-image synthesis.

Reed et al. [7] proposed GAN-int-cls by revising DCGAN and successfully generated plausible 64 × 64 images of birds and flowers from texts. In order to produce high resolution images, multiple stages generating was introduced into text-to-image synthesis, such as StackGAN [21], StackGAN++ [22], HDGAN [23], and LAPGAN [24]. StackGAN [21] stacked two conditional GANs to generate high resolution and plain images in two stages. Multiple generators were used to generate images of different scales using tree structure in StackGAN++ [22]. HDGAN [23] adopted hierarchically-nested discriminators to help the single-stream generator generate high resolution images. LAPGAN [24] put forward a Laplacian pyramid framework through integrating a set of generators.

Xu et al. [25] and Qiao et al. [26] added attention mechanisms to synthesize image with fine-grained details. Besides, Reed et al. [27] adapted bounding box and key part information to improve quality of generated images. ACGAN [28] and TAC-GAN [29] used auxiliary class information to generate diversity images. Because these models show excellent cross modal generative ability, text-to-image GAN has been used for image editing [30, 31], cross modal retrieval [1], story visualization [2], and painting [3]. However, these models are too complicated to be deployed on the mobile end. To this end, we propose an end-to-end compression framework based on CPD. Compared to Shu et al. [32] and Li et al. [33], we do not need to pretrain GAN model. We design and train the compression model from scratch.

3. Canonical Polyadic Generative Adversarial Networks (CPGAN)

In this section, we introduce the designing of the efficient architecture (CPGAN) and the training process. Section 3.1 describes how to replace 4-dimensional convolutional weight tensors with three small kernels. Section 3.2 describes techniques for stabling training process of the redesigned architecture.

3.1. Canonical Polyadic Decomposition

GAN consists of a generator and a discriminator in general, both of which are convolutional neural networks. The weight tensor for convolution is a 4-dimensional tensor , which maps input into another representation . It can be written aswhere the first two dimensions of are the spatial dimension ( is typically 3 or 5), the third dimension is the input channel, and the fourth dimension is the output channel.

CPD is an approximation method which decomposes a tensor into a sum of rank-one tensors. In CPD, tensor can be represented aswhere is the tensor rank and it is the sum of rank-one tensors, and are tensors of size , respectively. Rank-one tensor is the vector outer product. Rank selection decides the compression ratio and it is a NP-hard problem in rank decomposition.

In convolutional layer, spatial dimension does not have to be decomposed because the benefits of spatial decomposition are quite small. By using the variant of CP decomposition, tensor can be decomposed aswhere is a tensor of size . Substituting equation (3) into equation (2), we obtain the following approximate representation of the convolution:

Performing rearranging and combining, we can get the following three consecutive expressions:where and are the intermediate tensors of sizesand , respectively. The original big layer can be decomposed into three small layers, as shown in Figure 1. For example, the third convolution layer of GAN-int-cls has 128 input channels, 512 output channels, and 3 × 3 filters (); we can decompose it into three convolution layers with the following parameters: , , and. is the rank which can be set as different values according to the need of tasks.

3.2. Overall Framework

We take the classical model GAN-int-cls as the original model to compress. This model has the most compact structure and parameters. The main convolution layers of the generator in other text-to-image GAN models are similar to GAN-int-cls. We redesign GAN-int-cls to show the effectiveness and generality of our compression architecture. As shown in Figure 2, the proposed CPGAN contains two novel components which can stabilize the training of decomposed GAN :  conditioning augmentation and autoencoder module.

Conditioning augmentation (CA) is proposed by Zhang et al. [21] which alleviates the difficulty of GAN training caused by text embedding sparsity. CA is to randomly sample the hidden variables as the input of the generator from the independent Gaussian distribution . is the text embedding which is generated by encoding text description. and are the mean and diagonal covariance matrix functions of the text embedding , respectively. We use pretrained char–CNN–RNN [34] to get the text embedding . Then, we feed into CA and obtain and . Similar to StackGAN [21], we also add the Kullback–Leibler (KL) divergence into our training objectives, which is the KL divergence between the standard Gaussian distribution and the conditioning Gaussian distribution , as shown in the following equation:

Autoencoder (AE) is used for representation learning by reconstructing input. The decomposed architecture is deeper than the original model, which increases the instability of training. So, we use AE to stabilize the training process. AE is composed of an encoder and a decoder in general. We regard each convolution layer as an encoder and add a decoder corresponding to each convolution layer. The training objective of AE is the reconstruction loss. We use mean square error (MSE) as the AE loss, where is the input of layer and is the function of AE. The decoder will be removed after training.

The generator objective of original GAN-int-cls contains matching-aware loss and interpolation loss, as shown inwhere is the random noise, and are text embeddings, and is a decimal between 0 and 1 and used to interpolate between text embeddings and .

In the generator objective of our model, we add KL divergence and MSE reconstruction loss into the original model objective, as shown in the following equation:

The discriminator objective of the original model and our model is both matching-aware loss:

We use the above scheme to train an efficient architecture from scratch. The training algorithm is shown in Algorithm 1. Firstly, original convolutions are decomposed into three layers through equations (5)–(7). Secondly, each layer is regarded as an encoder and a decoder is added corresponding to each layer. Thirdly, we encode matching text and mismatching text and get text embeddings. Then, we use CA to process text embeddings and get independent Gaussian distribution. From the independent Gaussian distribution, we sample variables and concatenate it with random noise. The following training process is the same as GAN-int-cls with different training objectives of generator. The objective function of our model adds the loss of CA and autoencoder on the basis of the original model’s objective function. Until the training is finished, we remove added decoder layers and obtain the model of CPGAN.

Input: mini-batch images , text description , and number of training batch steps .
Output: CPGAN model.
(1)Use equations (5)–(7) to decompose the original convolutional layer in generator;
(2)Add CA module for text embedding and add decoders layers;
(3)Select an appropriate learning rate for the decomposed model;
(4)Fortodo
(5) Encode text description into embedding ;
(6) Feed into CA and obtain ;
(7) Sample from and random noise ;
(8) Concatenate and and feed it into the generator;
(9) Update discriminator D by equation (11);
(10) Update generator G by equation (10);
(11)End for
(12)Discard all decoders and get a trained CPGAN.

4. Experiments

We conduct extensive experiments to evaluate the proposed CPGAN. In Section 4.1, we introduce the experimental dataset and evaluation index. Section 4.2 describes the setting of learning rate and the other experimental hyperparameters. In Section 4.3, we compare our CPGAN with previous GAN-int-cls models for text-to-image synthesis.

4.1. Overall Framework

To show the generality of our method, we choose the classic model GAN-int-cls as our original model. Same as GAN-int-cls, our method is evaluated on CUB [35] and Oxford-102 [36]. The CUB dataset covers 200 kinds of birds, including 5,994 training images and 5,794 test images. In addition to category labels, each image contains bounding box, bird key part of bird information, and bird attributes. Oxford-102 flowers dataset is a flower dataset which contains 8,189 images. It is divided into 102 categories and each category contains 40 to 258 images. Each image has large scale, pose and light variations. The dataset is divided into a training set, a validation set, and a test set. Both datasets are benchmark image datasets and each image corresponds to 10 single sentence descriptions.

In order to evaluate our model, we use inception score (IS) and Fréchet inception distance (FID) to evaluate the quality of the generated images. IS uses pretrained InceptionNet-V3 to judge whether the generated image is clear and diverse. High IS score means that images are clear and diverse. FID calculates feature distance between the real image and the fake image as a supplement of IS evaluation index. These two indicators are widely used to evaluate the quality of generated images.

4.2. Implementation Details

Learning rate is a very important hyperparameter in deep learning. Reasonable learning rate can make the model converge to the minimum point instead of the local optimal point or saddle point. In this paper, we use the method CLR [11] and MultistepLR to set learning rate and learning rate attenuation.

CLR was proposed by Smith. It changes learning rate periodically in the iterative process, rather than a fixed value. It is used to find the optimal learning rate automatically instead of manual experiments. We use CLR to get a learning rate setting. CLR method needs to set three parameters, minimum learning rate (min_lr), maximum learning rate (max_lr), and iteration. min_lr and max_lr are the smallest value and the biggest value of learning rate, respectively. Iteration is the number of test iterations at each learning rate. We increase the learning rate from 0.00001 to 0.001 and get the loss curve under different learning rates (see Figure 3).

We choose the appropriate learning rate according to maximum absolute slope criterion. According to Figure 3, we select 0.0002 and 0.00015 as the learning rates of the Oxford-102 dataset and 0.0001 and 0.00008 as the learning rates of the CUB dataset.

MultistepLR is a learning rate attenuation method in PyTorch. It has three hyperparameters: initial learning rate (), epoch to update learning rate (), and multiplication factor(). is the initial learning rate during the training. is the epoch when we change the learning rate. is the attenuation coefficient of learning rate. In the experiment using MultistepLR, the initial learning rate is . When the experiment runs epochs, the learning rate is changed to .

In this paper, we set the MultistepLR hyperparameters , , and as 0.0001, 600, and 0.8 in CUB and 0.0002, 600, and 0.75 in Oxford-102. The batch size in our experiment is 64. The optimizer of CPGAN is Adam [37] with momentum of 0.5.

4.3. Comparison with Original Model

In CP decomposition, rank represents compression ratio and it is hard to select. Due to the need of text-to-image synthesis task, we design the lightweight model on the premise of ensuring the quality of generated images. We do extensive experiments to balance the performance and the compression ratio.

As shown in Table 1, we do a large number of experiments to find the balance. The ratio is rank ratio, where 1.0 is full rank decomposition and 0.9 means about 0.9 times of original layer’s number of input channels. A layer with a smaller rank has few parameters. Table 1 shows that with increasing of rank, flops and parameters grow. When the decomposition rank is close to 0.7, the parameters begin to exceed the original model’s parameters (). With the increase of rank, the quality of images generated by the model has not been greatly improved. The value of FID decreases first and then changes slightly with the increase of model parameters, while IS is not stable. It may be that the calculation of IS needs to use the edge distribution of data, but generated samples in Oxford-102 are not enough to get accurate edge distribution.

As shown in Table 1, FID gets the best value when rank ratio is 0.5. The model is compressed by about 23% parameters and 29% flops. The generated images are better than the original model on FID and IS. It can prove that our method can generate better images with less parameters than the original model. It is effective to use CP decomposition to reconstruct the model and design compact text-to-image GAN without loss of image quality. Although flops and parameters are reduced, the images generated by CPGAN get a little improvement on IS and FID. This shows that the image generated by the model with more parameters may not be better. So around the rank of 0.5, we look for a better model ensuring the quality of generated images.

Table 2 shows the comparison between our best generative model and the original model on IS, FID, parameters, and flops. FID and IS of the original model are 79.55 and 2.66 ± 0.03 in Oxford-102, while those of our best model are 74.40 and 3.68 ± 0.08, respectively. In CUB, the images generated by our best model get 65.94 on FID and 5.03 ± 0.07 on IS, while those of original model are 68.79 and 2.88 ± 0.04, respectively. The comparison of representative images on Oxford-102 and CUB dataset can be seen in Figures 4 and 5, respectively. The better generated images of CPGAN indicate that our proposed method can generate more realistic images from text descriptions. These results also prove that there are redundant parameters in existing text-to-image GAN. A more concise and efficient text-to-image GAN model can be designed based on CPD.

5. Conclusions

In this paper, we propose a simple and efficient architecture CPGAN based on CPD. CPGAN can reduce extensive parameters and flops of the original model. It also improves the quality of generated images at the same time. In the process of designing CPGAN model, we replace the convolution layer with three CP decomposed small layers to achieve a certain compression. In order to stabilize the training process, we introduce conditioning augmentation to reduce the instability caused by text embedding sparsity. To further improve the end-to-end training of our model, the idea of autoencoder is integrated into the model. Each decomposed layer can be regarded as an encoder layer and is paired with an added decoder layer. The decoder layers can be removed after training. Experiments demonstrate that CPGAN reduces about 23% parameters and 29% flops with a little improvement of generated image quality in Oxford-102. Extensive experimental results demonstrate that our proposed CPGAN can design an efficient text-to-image GAN. We have also decomposed similar convolution layers in other GAN models and these experiment results were similar to the experiment results of GAN-int-cls. The main convolution layers of the generator in other text-to-image GAN models are similar to GAN-int-cls. It is applicable for other cross modal GANs to use CPD. In the existing methods, the rank is set manually, which is time-consuming. Therefore, the automatic selection of rank may be a research direction in the future.

Data Availability

The datasets used in this paper are public datasets which can be accessed via the following websites: http://www.vision.caltech.edu/visipedia/CUB-200-2011.html and https://www.robots.ox.ac.uk/∼vgg/data/flowers/102/

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.