1 Introduction

Compared to grayscale images, color images contain detail and clear information so that, in many applications, when only grayscale images are generated, the grayscale image is often transformed into a color image first. Over the last decade, many studies have been done in various fields to automatically and effectively generate color images [1]. For example, automatic colorization [36] is used for grayscale satellite images to help eliminate lighting differences between multi-spectral captures, providing strong prior information for ground type classification and object detection [33]. Synthetic Aperture Radar (SAR) colorization [33] takes advantage of the well-established radar polarimetry theories and polarimetry approaches to interpret the vast majority of SAR images that are not full-pol. A semi-automatic colorization system takes a monochrome manga and reference images as input and generates a plausible color version of the manga [8].

Color medical imaging [29] is used in various fields, such as building virtual human models for educational purposes. A color mapping of the distribution of shear stress images [7] has been studied along the plaque surface over the last decade. However, most medical image datasets are photographed as grayscale images, so it is difficult to obtain enough color images. To solve this problem, a colorization operation for converting monochrome images into color images should be considered.

Back in early 2000s, models for colorization began. At first, researchers colorized a grayscale image by matching luminance and texture information between reference color images and the target grayscale image. Thus, the colorized results were deterministic, and relied heavily on the collection of reference images, which is referred to as a forward problem. We call this kind of method transfer-based [2, 5, 14, 23, 27, 37]. In addition, researchers proposed scribble-based colorization [8, 11, 22, 24, 31, 35, 38], which works by following users’ guidance with manual intervention. For example, users first scribble regions of an image using different colors; then, the system colorizes the image based on the scribbles. This kind of method is useful in some scenarios, but not for medical image colorization. In this paper, we aim at development of fully automatic colorization, which takes a grayscale image as input, then outputs a colorized image directly, without any manual intervention.

A lot of work has proposed fully automatic colorization [3, 10, 15, 19, 28, 32, 34, 42]. Most of it was based on the deep neural network (DNN) [40, 41] and achieved high-quality image colorization. For example, Larsson et al. [19] proposed an end-to-end, fully automatic image colorization scheme by using a deep convolutional architecture, exploiting both low-level and semantic representations. Their system (at the time of writing) achieved a state-of-the-art ability to automatically colorize grayscale images. Zhang et al. [42] reported similar work on fully automatic colorization. In this paper, our proposed system falls into the category of fully automatic colorization using a DNN. However, the problem is that those previous works required each input and output pair of training examples to be both paired and aligned. But obtaining paired training data can be expensive and difficult, especially for medical images. In this paper, our proposed model can use unpaired image data for training.

With the development of generative adversarial networks (GANs) [21], Chia et al. [5] utilized a GAN to propose style conversion networks that can transfer an image style from another image, which also can be applied to transferring color to a grayscale image, or to synthesize a color image from a grayscale image [10, 15, 28, 32]. The CycleGAN [44], in particular, enables the learning of unpaired datasets by applying cycle-consistency. This advantage offers significant benefits when it is difficult to obtain a large amount of paired training data. Although the GAN has been applied to medical image processing for various purposes [39], research on its application to the colorization of medical images has not been conducted. In this paper, our work builds on the CycleGAN to colorize medical images, and achieves better performance by introducing the perceptual loss function [20] and the total variation (TV) loss function [35].

The contributions of this paper are as follows.

  • We demonstrate that GAN-based style conversion can be applied to colorization of medical images.

  • We demonstrate that the introduction of perceptual loss and TV loss can improve the quality of images produced as a result of colorization, and the result is better than the result generated by only using the CycleGAN model. Our model inherits the characteristics of CycleGANs that does not require the input and output training image data to be paired and aligned. This is a significant advantage when applying colorization to medical images.

This manuscript is organized as follows. In Section 2, we introduce related studies about scribble-based, transfer-based, and fully automatic colorization methods. In Section 3, we introduce our proposed unpaired GAN network and loss functions. Experimental results with a small dataset and a large set of medical images are presented in Section 4. Finally, conclusions are given in Section 5.

2 Related work

Colorization techniques [12] have been widely used in scientific illustrations, medical image processing, and old-photo colorization to enhance features and increase visual appeal. The large degree of freedom during the assignment of color information makes image colorization very challenging. Over the last decade, many approaches have been proposed to address colorization issues. Previous work falls into three broad categories, which are scribble-based [8, 11, 22, 24, 31, 35, 38], transfer-based [2, 5, 14, 23, 27, 37], and fully automatic colorization [19].

Scribble-based methods need users to scribble on the object grayscale image, and then put it through a coloring optimization algorithm [22]. The algorithm propagates the scribbled colors on the assumption that adjacent pixels with the same pixel values will have the same colors. The user needs to have enough knowledge about scribbling to paint enough of the color graffiti into the area. For ordinary users, it could be difficult to provide enough of the appropriate colors to achieve good results. To make the scribbling process easy and to generate better colorization results, Zhang et al. [43] proposed real-time, user-guided image colorization using a DNN. Because the colorization is performed in a single feed-forward pass, it enables real-time use.

Transfer-based methods rely on the availability of reference images in order to transfer those colors to the target grayscale image. It works well if there are enough similar reference images, and also requires less skill and effort from the user, compared to scribble-based methods. But sometimes a unique grayscale input image cannot match any similar reference images, which leads to a non-ideal result. Also, this requires building a large repository to store reference images at test time. A transfer-based colorization system uses a strategy that establishes mapping between source and target images by using correspondence between local descriptors [2, 27, 37], or in combination with the user’s guidance [5, 14].

Fully automatic colorization is our goal, in contrast to the above methods. Grayscale images are the input for the system, which outputs the corresponding color images directly, without manual intervention. The fully automatic image colorization systems [3, 19, 34, 42] achieve high-quality performance that generates high-quality colorized images owing to the emergence and development of DNN technology. A lot of work has also tried to apply a Generative Adversarial Network (GAN) [9] to colorization, focusing on improving training stability and making robust color image synthesis in large multi-class image datasets [10, 15, 28, 32]. Nazeri et al. [28] applied a GAN to image colorization by redefining the generator cost function proposed in the first GAN model [9], and they also adopted the idea from the conditional generative adversarial network [26] to allow grayscale images to be the input for the generator, rather than randomly generated noise data. Sharma et al. [32] proposed robust image colorization using a self-attention–based progressive GAN, which consists of a residual encoder–decoder network and a self-attention–based progressive generative network in cascaded form to achieve robust image colorization. Górriz et al. [10] used conditional GAN, and improved training stability to enable better generalization in large multi-class image datasets. Isola et al. [15] proposed a framework, named pix2pix, which uses a conditional generative adversarial network to learn mapping from input to output images. Those works have a common drawback in that each input and output pair of training examples needs to be paired and aligned. However, obtaining paired training data can be expensive and difficult. Zhu et al. [44] proposed a CycleGAN built on pix2pix to learn the mapping without having paired training examples. Mehir et al. [25] proposed an architecture to colorize near infrared images. It learns the color channels on unpaired dataset, requires less computation times, converges faster and generates high quality samples. The advantage that uses unpaired training dataset offers significant benefits, given that it is difficult to obtain enough paired training examples, especially in medical images. Our proposal in this paper builds on the CycleGAN, but we achieve better performance by introducing the perceptual loss function and the TV loss function (Table 1).

Table 1 Overview of existing work in three broad categories. Our proposal belongs to unpaired fully automatic colorization

3 Method

We introduced perceptual loss and TV loss into CycleGAN to colorize medical images. On one hand, our method inherits a main advantage from CycleGAN in that it can learn to translate an image from a source domain, Igray, to a target domain, Icolor, in the absence of paired examples while training the model. This characteristic is very practical if there is a lack of paired training data, especially in medical images. On the other hand, the experiment results show that our method outperforms the CycleGAN and Perceptual-based models. In this section, we first present the overall structure of our proposal. Then, we present in detail the objective functions of adversarial loss, cycle-consistency loss, perceptual loss, and TV loss. Finally, we present the final objective function of our proposal.

3.1 Network structure

We present the overall structure of the proposed technique in Fig. 1 (a). The technique is based on the structure of CycleGAN that enables learning of unpaired datasets, i.e., the domains of Igray and Icolor are unpaired. The loss function of CycleGAN consists of the sum of the adversarial loss, which determines whether the input image is composite or not, and the cycle-consistency loss between the reconstruction image, which is created by restoring the composite image to the original image. The parameters of the perceptual loss network in Fig. 1(a) are pre-trained and untrainable. The functionality of the perceptual loss network is to generate features for three kinds of data. The first is Igray, denoting the source domain or gray medical image to be colorized. The second is Gcolor(Igray), and the third is Ggray(Gcolor(Igray)), where Ggray is a generator network to generate gray images, and Gcolor is another generator network to generate color images (Fig. 2).

Fig. 1
figure 1

a The overall structure of the proposed technique contains two map functions, Gcolor : Igray → Icolor and Ggray : Icolor → Igray, associated with adversarial discriminators, Dgray(Igray) and Dcolor(Icolor), and a Perceptual Loss Network. With ∅i, we indicate the feature map obtained by the i-th convolution (after activation) before the max-pooling layer within the VGG16 network. b Forward cycle-consistency loss: Igray → Gcolor(Igray) → Ggray(Gcolor(Igray)) ≈ Igray. c Backward cycle-consistency loss: Icolor → Ggray(Icolor) → Gcolor(Ggray(Icolor)) ≈ Icolor. We adopted (b) and (c) from [44]

Fig. 2
figure 2

Image samples of the Messidor dataset. a Colorful samples. b Gray samples

3.2 Objective functions

In this section, we present the details of the loss functions used in our proposed scheme. Those functions are essential, and they significantly affect the performance of the colorization system. There are four kinds of loss functions: adversarial loss, cycle-consistency loss, perceptual loss, and total variation loss, as described below. Finally, we combine these functions to get our full objective function.

Adversarial loss

We adopted the adversarial loss from [9] and applied it to both mapping functions. For the mapping functions Gcolor : Igray → Icolor and its discriminator, Dcolor, we present the loss function as

$$ {\displaystyle \begin{array}{c}{L}_{GAN}\left({G}_{color},{D}_{color},{I}_{gr\mathrm{a}y},{I}_{color}\right)=E\left[\log \left({D}_{color}\ \left(\ {I}_{color}\ \right)\right)\right]\\ {}+E\left[\log \left(1-{D}_{color}\left({G}_{color}\ \left({I}_{gr ay}\right)\right)\right)\right]\end{array}} $$
(1)

where Gcolor tries to generate images \( {\hat{I}}_{color}={G}_{color}\left({I}_{gray}\right) \) that look similar to images from domain Icolor, while Dcolor aims to minimize this objective against an adversary, D, that tries to maximize it, i.e., \( {\mathit{\min}}_{G_{color}}{\mathit{\max}}_{D_{color}}{L}_{GAN}\ \left({G}_{color},{D}_{color},{I}_{gray},{I}_{color}\right) \). We introduce a similar adversarial loss for mapping function Ggray: Icolor → Igray and its discriminator, Dgray, as well; i.e., \( {\mathit{\min}}_{G_{gray}}{\mathit{\max}}_{D_{gray}}{L}_{GAN}\ \left({G}_{gray},{D}_{gray},{I}_{color},{I}_{gray}\right) \).

Cycle-consistency loss

We apply cycle-consistency loss [44] to transfer images from the Igray domain to the Icolor domain, and vice versa. The advantage from cycle-consistency loss is that Igray and Icolor can be unpaired, which is a very practical and useful characteristic when it is costly to label or mark the training data.

$$ {\displaystyle \begin{array}{c}{L}_{cyc}\left({G}_{color},{G}_{gray}\right)=E\left[\parallel {G}_{gray}\ \left({G}_{color}\ \left({I}_{gray}\right)\right)-{I}_{gray}{\parallel}_1\right]\\ {}+E\left[\parallel {G}_{color}\left({G}_{gray}\left({I}_{color}\right)\right)-{I}_{color}{\parallel}_1\right]\end{array}} $$
(2)

Perceptual loss

Ledig et al. [20] used perceptual loss to improve the quality of the resulting image. Perceptual loss [17] introduces a pre-learned VGG16 model to an additional feature extraction network to extract the features of the composite image and the original image, and it then calculates the difference between them. In this experiment, calculation of the perceptual loss between the reconstructed image and the original image was added to the learning process.

$$ {L}_{perceptual\_ invert}=E\left[\parallel \varnothing \left({G}_{color}\left({I}_{gray}\right)\right)-{I}_{gray}{\parallel}_1\right] $$
(3)
$$ {L}_{perceptual\_ cycle}=E\ \left[\parallel \varnothing \left({G}_{gray}\ \left({G}_{color}\left({I}_{gray}\right)\right)\right)-{I}_{gray}{\parallel}_1\right] $$
(4)

where ∅i indicates the feature map obtained by the i-th convolution (after activation) before the max-pooling layer within the VGG16 network.

The final perceptual loss function is

$$ {L}_{perceptual}\left({G}_{color},{G}_{gray}\right)={L}_{perceptual\_ invert}+{L}_{perceptual\_ cycle} $$
(5)

Total variation loss

TV loss [35], which has a smoothing effect, was added to prevent excessive contrast in the resultant image:

$$ {L}_{TV}(Z)={\sum}_{w,\kern0.5em h}{\left[\ {\left({Z}_{w,h+1}-{\mathrm{Z}}_{w,h}\right)}^2\ {\left({Z}_{w+1,h}-{Z}_{w,h}\right)}^2\ \right]}^{\frac{B}{2}} $$
(6)

where Z = Gcolor(Igray), B = 1.

Full objective function

Hence, we add all loss functions mentioned above and get our final loss function of the proposed technique, as follows:

$$ {\displaystyle \begin{array}{c}L\left({G}_{color},{D}_{color},{G}_{gray},{D}_{gray}\right)={L}_{GAN}\ \left({G}_{color},{D}_{color},{I}_{gray},{I}_{color}\right)\\ {}+{L}_{GAN}\left({G}_{gray},{D}_{gray},{I}_{color},{I}_{gray}\right)+{L}_{cyc}\left({G}_{color},{G}_{gray}\right)\\ {}+{L}_{perceptual}\left({G}_{color},{G}_{gray}\right)+{L}_{TV}(Z)\end{array}} $$
(7)

where Z = Gcolor(Igray).

4 Experiment

4.1 Dataset description and preprocessing

Drive

DRIVE [6] only contains 40 digital color retinal images. We divided them into two average sets, named ColorA and ColorB. Then, we converted all images in ColorA to grayscale and stored them in another set, named GrayA. We use GrayA and ColorB to train our model. When validating the trained model, we input GrayA to the model to get the predicted resulting images, which were used to compare with the images in ColorA to calculate performance metrics. Because of the small data size, we used GrayA to both train and validate the model. Ideally, the training set has no intersection with the test or validation sets. Therefore, to overcome this disadvantage, we also used another bigger dataset as follows.

Messidor

The Messidor Database [15] contains 1200 digital color retinal images. We first cropped all images to make them square (1488 × 1488) and resized them to 256 × 256. Then, we shuffled them before dividing them into three sets (400 images each) named TrainA, TrainB, and ValColor. We converted the images in TrainA to grayscale, and copied all ValColor images, converting the copies into a grayscale dataset named ValGray. We use TrainA and TrainB to train our model, and used ValColor and ValGray to calculate performance metrics after training. Note that, while training the model, inputs Igray and Icolor are unpaired, which means not only is the color different, but the image texture is not aligned. Also, training sets TrainA and TrainB had no intersection with validation sets ValGray or ValColor.

4.1.1 Training details

With the DRIVE dataset, we chose training parameter epoch num = 200 and decided that the learning rate scheduler would be torch.optim.lr_scheduler.CosineAnnealingLR from PyTorch, which reduced the learning rate over the training epochs and finally reached zero. These decisions were made based on experiments explained as follows. The DRIVE dataset only contains 40 images, which is very few. Thus, we needed to take care of the overfitting problem while training the model. First, we set up the parameters so that batch size = 1, epoch num = 2000, and constant learning rate = 0.002, and we got Fig. 3(c) and (d), which show that the values of the loss functions were very unstable over the training time, and while epoch num > 400, the values of the loss functions almost reached the lowest point. Secondly, we set the parameters batch size = 1 and epoch num = 200, and used a cosine-annealing kind of learning rate to get Fig. 3(a) and (b), which show that the values of the loss functions became much more stable when the training epoch was around 200. We also manually checked the predicted (generated) images after finishing the model training, finding that a model trained for 200 epochs generates color images with some additional, strange components, which means the model overfits. Therefore, we set the parameters as mentioned in the first sentence above, and applied them to all experiments with the DRIVE dataset.

Fig. 3
figure 3

With the DRIVE dataset, the training loss over time with respect to each loss function inside the relative model. a Epochs training the proposed model, b 200 epochs training the proposed model, c 1000 epochs training the proposed model, d 1000 epochs training the proposed model

On the Messidor dataset, we used Adam [18] with a learning rate of 2 × 10−3, along with a cosine-annealing kind of scheduler to reduce the learning rate over the training epochs and finally reached zero. We resized all images to 256 × 256 and trained the networks with batch size = 2 for 200 iterations, giving 50 epochs over the training data. We did not use weight decay or a dropout technique. Our implementation used PyTorch [30] and cuDNN [4], taking roughly 1.5 h on a single GTX 1080 Ti GPU.

4.2 Results

4.2.1 Performance results on the DRIVE dataset

Experiment results in Table 2 compare performance among the CycleGAN and Perceptual models against our proposal. For each model, we performed the experiment three times, and each time, we used 20 ground truth images, and generated images to calculate metrics from NRMSE, PSNR, and SSIM. Lower NRMSE scores but higher PSNR and SSIM scores indicate better performance. Table 2 shows the mean values and the standard deviations of these metrics. CycleGAN applied GAN loss, cycle loss, and identity loss. The source code of its implementation in the PyTorch framework can be found in [13]. We added perceptual loss to CycleGAN, and kept other conditions unchanged, to be the second method in Table 2. The experiment results show that the Perceptual method has better performance than CycleGAN. Furthermore, we added perceptual loss plus TV loss to CycleGAN to be our proposal. The results show that the performance from our proposal is better than the other two.

Table 2 A DRIVE dataset comparison of mean and standard deviations from NRMSE, PSNR, and SSIM

4.2.2 Performance results with the Messidor dataset

Table 3 shows the NRMSE, PSNR, and SSIM performance metrics of the CycleGAN and Perceptual methods against our proposed model with the Messidor dataset, which indicates that the performance of our method is better than the other two. We chose batch size = 2, epoch num = 50, and initial learning rate = 0.002, using torch.optim.lr_scheduler.CosineAnnealingLR from PyTorch, which reduced the learning rate over the training epochs, and finally reached zero. Figure 4 shows the values of the loss functions over the training time, which indicate that the values of loss functions are not so stable, and around epoch 50, these loss values hardly improved at all. While training the models, we saved them at the end of each epoch. Thus, after finishing the training, we chose the best epoch that can colorize the best images but that does not overfit.

Table 3 With the Messidor dataset, a comparison of mean and standard deviations from NRMSE, PSNR, and SSIM. We added perceptual loss to CycleGAN to get the Perceptual (Third row). Furthermore, we added TV loss to Perceptual to get our proposal
Fig. 4
figure 4

With the Messidor dataset, values of loss functions of each model over the training time. a CycleGAN loss over time, b Perceptual loss over time, c Our proposal’s loss over time

We present samples of the colorized results in Fig. 5. The first column is the gray input images to be colorized by the models. Columns 2 to 4 are the colorized results from the CycleGAN and Perceptual models, and our proposal, respectively. The final column is ground truth images. We trained these three models under the same conditions, including the same hyperparameters. We used the trained models from epoch number 50 to generate color images, and show some of them in Fig. 5. Two obvious disadvantages can be found in the second column, generated by CycleGAN. One is that there are rings, indicated with arrows. This is because the size of the retina varies in proportion to different training images. The second disadvantage is that the retina texture is blurred, and the color of the whole image is quite different from ground truth. The images in the third column in Fig. 5 show the results of the Perceptual model. The retinal texture is clear, and the color of the whole image looks like the ground truth image. However, the model generates slight and imperceptible rings, which is still much better than CycleGAN. Besides, it generated some strange components that do not belong in the training images. Except for that, when zooming in the bottom image in the third column, we find that the area indicated by the arrow is not smooth enough, compared with ground truth and our proposal. We can see that the results generated by our proposed method are much better with regard to image color, retina texture, and smoothness.

Fig. 5
figure 5

From the Messidor dataset, colorization results of the CycleGAN, Perceptual methods, and the proposed method. a Gray input, b CycleGAN, c Perceptual, d Proposal, e Ground truth

4.2.3 Effect of identity loss

The implementation of CycleGAN applies identity loss [16] even though the paper on CycleGAN does not mention it, which specifically are Lidt _ gray =  ∥ Ggray(Igray) − Igray1 and Lidt _ color =  ∥ Gcolor(Icolor) − Icolor1. The identity loss indicates that if the input of Gcolor is a color image, its output should also be a color image that is identical to the input. (Generally, the job of Gcolor is to transform gray images into color images.) A similar job is true for Ggray with color images. We performed experiments to verify the performance contribution of identity loss. Without using the identity loss function in our proposed method, we performed the experiment three times to measure the NRMSE, PSNR, and SSIM metrics shown in Table 4. The results illustrate that the metric No Identity Loss is not better than With Identity Loss, but they are close. We furthermore manually checked their generated images, and show samples in Fig. 6. We found that the images for No Identity Loss are fragile, which tends to introduce some strange areas that do not exist in the ground truth images. Examples are in the second column and are indicated by the arrows. Intuitively, this is because identity loss can force the generated images to look like the input images.

Table 4 With the Messidor dataset, a comparison of mean and standard deviations from NRMSE, PSNR, and SSIM
Fig. 6
figure 6

The colorized results: with and without identity loss

4.2.4 Symmetric and asymmetric structures of perceptual loss

The structure of CycleGAN is symmetric, and our proposal was built upon it by introducing perceptual loss and TV loss. Thus, a natural thought is about the performance if the structure of perceptual loss is also symmetric. Specifically, we determine perceptual loss with Eq. (8), as seen below. Furthermore, we add the asymmetric part, Lperceptual _ invert, from Eqs. (3) to (8) to obtain Eq. (9).

$$ {\displaystyle \begin{array}{c}{L}_{perceptual\_\mathit{\operatorname{sym}}}\left({G}_{color},{G}_{gray}\right)={L}_{perceptual\_ cycle}+E\left[\parallel \varnothing \left({G}_{color}\left({G}_{gray}\left({I}_{color}\right)\right)\right)-{I}_{color}{\parallel}_1\right]\\ {}=E\ \left[\parallel \varnothing \left({G}_{gray}\ \left({G}_{color}\left({I}_{gray}\right)\right)\right)-{I}_{gray}{\parallel}_1\right]+E\left[\parallel \varnothing \left({G}_{color}\left({G}_{gray}\left({I}_{color}\right)\right)\right)-{I}_{color}{\parallel}_1\right]\end{array}} $$
(8)
$$ {L}_{perceptual\_\mathit{\operatorname{sym}}\_2}\left({G}_{color},{G}_{gray}\right)={\mathrm{L}}_{perceptual\_\mathit{\operatorname{sym}}}\left({G}_{color},{G}_{gray}\right)+{L}_{perceptual\_ invert} $$
(9)

Currently, three options can be the perceptual loss in our models: Eqs. (5), (8), and (9). We performed experiments to verify the performance for each of them. Table 5 shows the results from the NRMSE, PSNR, and SSIM performance metrics. The performance is close, but our proposal is the best. Additionally, we generated color images with a trained model by using Lperceptual _ sym, and we manually checked the quality of the colorized images, from which we find that most of them have redundant spots that do not exist in the corresponding ground truth images. Figure 7 shows samples of these images, with the abnormal areas indicated by the arrows. Although the performance of Lperceptual _  sym  _ 2 and Lperceptual are close, based on Table 5, Occam’s razor suggests it is better to not use more than is necessary. Therefore, Lperceptual is better because it is simpler.

Table 5 With the Messidor dataset, a comparison of mean and standard deviations from NRMSE, PSNR, and SSIM
Fig. 7
figure 7

Colorized results of Symmetric Perceptual Loss. The gray images are the input, while the color images are the results of the model using Symmetric Perceptual Loss

5 Conclusion

A DNN model for colorization is data-driven, and training millions of parameters of the model needs a large amount of labeled or paired image training data. In this paper, we focused on solving the problems that 1) The colorized medical image should have the same features as the input image; 2) It is difficult and expensive to obtain many paired training data of medical images. We demonstrated that our model could use unpaired medical images to train a colorization system to synthesize high-quality color medical images. Our model is built on CycleGAN, introducing perceptual loss and TV loss. Our model has almost the same training and testing process complexity, but the experiment results show our model has better performance than CycleGAN in colorizing medical images.