Introduction

In the last decade, the utilization of images has grown tremendously. Images are corrupted with noise in the process of acquisition, compression, and transmission. Environmental, transmission, and other channels are mediums through which images are corrupted by noise. In image processing, image noise is the variation in signal (in random form) that affects the brightness or color of image observation and information extraction. Noise adversely affects image processing tasks (such as video processing, image analysis, and segmentation) resulting in wrong diagnosis [1]. Hence, image denoising is a fundamental aspect which strengthens the understanding of image processing task.

Due to the increasing generation of digital images captured in poor conditions, image denoising methods have become an imperative tool for computer-aided analysis. Nowadays, the process of restoring information from noisy images to obtain a clean image is a problem of urgent importance. Image denoising procedures remove noise and restore a clean image. A major problem in image denoising is how to distinguish between noise, edge, and texture (since they all have high-frequency components). Interestingly, the most discussed noise in literature is the: additive white Gaussian noise (AWGN) [2], impulse noise [3], quantization noise [4], Poisson noise [5], and speckle noise [6]. AWGN occurs in analog circuitry, while impulse, speckle, Poisson, and quantization noise occur due to faulty manufacturing, bit error, and inadequate photon count [7]. Image denoising methods are used in the field of medical imaging, remote sensing, military surveillance, biometrics and forensics, industrial and agricultural automation, and in the recognition of individuals. In medical and biomedical imaging, denoising algorithms are fundamental pre-processing steps used to remove medical noise such as speckle, Rician, Quantum, and others [8, 9]. In remotes sensing, denoising algorithms are used to remove salt and pepper, and additive white Gaussian noise [10, 11]. Synthetic aperture radar (SAR) images provide space and airborne operation in military surveillance [12]. Image denoising algorithms have helped to reduce speckle in SAR images [13]. Moreover, forensic images do not have a specific kind of noise, they could be corrupted by any kind of noise. This noise can reduce the quality of evidence in the image thus, image denoising methods have helped suppress noise in forensic images [14]. Image denoising methods were used to filter paddy leaf and detect rice plant disease. Undoubtedly, image denoising is a hot area of research, encompassing all spheres of academic endeavor.

The linear, non-linear and non-adaptive filters were the first filters used for image applications [15]. Noise reduction filters are categorized into six (linear, non-linear, adaptive, wavelet-based, partial differential equation (PDE), and total variation filters). Linear filters appropriate output pixels with input neighboring pixels (using a matrix multiplication procedure) to reduce noise. Non-linear filters preserve edge information and still suppress noise. In most filtering applications, the non-linear filter is used in place of the linear filter. Linear filter does not preserve edge information; hence, it is considered a poor filtering method. A simple example of a non-linear filter is the median filter (MF) [16]. Adaptive filters employ statistical components for real-time applications (least mean square [17] and recursive mean square [18] are examples). Wavelets-based filters transform images to the wavelet domain and are used to reduce additive noise [19, 20]. A detailed review of different denoising filters is available in reference [21, 22].

Most of the above-mentioned filters have produced reasonably good results, however, they have some drawbacks. These drawbacks include poor test phase optimization, manual parameter settings, and specific denoising models. Fortunately, the flexibility of convolutional neural networks (CNN) has shown the ability to solve these drawbacks [23]. CNN algorithms have shown a strong ability to solve many problems [24]. For example, CNN has achieved excellent results in image recognition [25], robotics [26], self-driving [27], facial expression [28], natural language processing [29], handwritten digital recognition [30] and so many other areas. Chiang and Sullivan [31] were the first to use CNN (deep learning) for image denoising tasks. A neural network (weighting factor) was used to remove complex noise, then a feedforward network [32] produced a balance between efficiency and performance of the denoised image. In the early developments of CNN, the vanishing gradient, activation function (sigmoid [33] and Tanh [34]), and unsupported hardware platform made CNN difficult. However, the development of AlexNet [35] in 2012 has changed the difficulty in CNN usage. More CNN architecture (such as; VGG [36] and GoogleNet [37]) have been applied to computer vision tasks. References [38, 39] were the first CNN architecture used in image denoising tasks. Zhang et al. [40] used the denoising CNN (DnCNN) for image denoising, super-resolution, and JPEG image blocking. The network consists of convolutions, back-normalization [41], rectified linear unit (ReLU) [42] and residual learning [43].

The use of CNN is not limited to general image denoising alone, CNN produced excellent results for blind denoising [44], real noisy images [45], and many others. Although several researchers have developed CNN methods for image denoising, only a few have proposed a review to summarize methods. Reference [46] summarized CNN methods for image denoising with categories based on noise type. Although this review is elaborate; it does not consider several methods for specific images. Again, the research did not consider recent methods (the year 2020 methods); hence, several research works published in late 2020 were unintentionally omitted. Our review provides an overview of CNN image denoising methods for different kinds of noise (including specific image noise). We discuss state-of-the-arts methods with emphases on image type and noise specification. The outline of CNN image denoising methods is depicted in Fig. 1. It is hoped that explanations in this study will provide an understanding of CNN architectures used in image denoising. Our contribution is summarized as follows:

  1. 1.

    Analysis of different CNN image denoising models, database, and image type.

  2. 2.

    The highlight of commonly used objective evaluation methods in CNN image denoising

  3. 3.

    Potential challenges and road maps in CNN image denoising.

Fig. 1
figure 1

CNN image denoising scheme

The rest of the paper is organized as follows. In Sect. 2, we review different CNN image denoising methods. In Sect. 3, we review databases for CNN image denoising algorithms. Section 4 gives an analysis of CNN image denoising; finally, the paper is concluded in Sect. 5.

Literature review

In this section, several existing methods for CNN image denoising will be discussed. We divide CNN image denoising approaches into two: (1) CNN denoising for general images, and (2) CNN denoising for specific images. The first approach uses CNN architectures to denoising general images, while the second approach uses CNN to denoise specific images. The first approach is widely used in CNN denoising applications when compared to the second. General images refer to images that represent a general purpose rather than the details (See [47] for samples of general images). Specific images are images intentionally created with a special or particular kind. For example, medical images, infrared images, remote sensing images, and others are kinds of specific images. The reason for dividing CNN denoising by image category is to bring readers up to speed with the latest CNN architecture with regards to image types. A block diagram depicting different approaches is shown in Fig. 1.

CNN denoising for general images

Reference [48] proposed the attention-guided CNN (ADNet) for image denoising. ADNet consists of 17 layers with 4 blocks (sparse block (SB), feature enhancement block (FEB), attention block (AB), and reconstruction block (RB)). The use of sparsity has shown to be effective for image application [49]; hence, the SB was used to improve efficiency, performance, and to reduce the depth of the denoising framework. The SB has 12 layers with two types (dilated Conv + BN + ReLU, and Conv + BN + ReLU). The FEB has 4 layers with 3 types (Conv + BN + ReLU, Conv, and Tanh), while the AB has a single convolution layer. The AB was used to guide the SB and FEB which are useful for unknown noise. Finally, the RB performs reconstruction to produce a clean image. For training, the mean square error [50] was used to create model training (see Fig. 2).

Fig. 2
figure 2

Attention-guided denoising CNN [48]

Some deep learning algorithm produces excellent results with synthetic noise; however, most of this network do not achieve good results in image corrupted by realistic noise. The research by Guo et al. [51] proposed the noise estimation removal network (NERNet). NERNet reduced noise on images with realistic noise. The architecture was divided into two modules; the noise estimation module and the noise removal module. The noise estimation module appropriates the noise-level map with the symmetric dilated block [52, 53] and the pyramid feature fusion [54]. Meanwhile, the removal module used the estimated noise-level map to remove noise. The global and local information for preserving details and texture were aggregated into the removal module. The output of the noise estimate module was passed into the removal module to produce clean images.

It is no gainsaying that CNN learns noise patterns and image patches effectively. However, this learning produces a network with a large amount of training data and image patches. Because of the aforementioned, reference [55] proposed the patch complexity local divide and deep conquer network (PCLDCNet). The network was divided into local subtask (according to clean image patch and conquer block) and was trained on its local space. Each noisy patch weighting mixture was combined with the local subtask. Finally, image patches were grouped by complexity [56], while the training of the k network was achieved by the modified stacked denoising autoencoders [57]. Network degradation is another problem in a deep learning network (the deeper the layer the higher the error rate). Although the introduction of ResNet [58] resolved this issue, there is still room for improvement. Shi et al. [59] proposed the hierarchical residual learning that does not require the identity mapping for image denoising. The network has 3 sub-networks: feature extraction, inference, and fusion. Feature extraction sub-network extracts patches representing higher dimension feature maps. The interference sub-network [60] contains cascaded convolutions that produce a large receptive field. The cascaded procedure was performed to learn noise maps from multiscale information and to produce tolerating errors in noise estimation. Finally, the fusion sub-network fuses the entire noise map to produce estimation.

Gai and Bao [61], used the improved CNN (MP-DCNN) for image denoising. MP-DCNN is an end-to-end adaptive residual CNN constructed for modeling noisy images. Noise from the input image was extracted by the leaky ReLU, and the image features were reconstructed. An initial denoised image was inputted into the SegNet to obtain edge information. The MSE and the perceptive loss function [62] were used to obtain the final denoised image (see Fig. 3).

Fig. 3
figure 3

MP-DCNN [61]

Another research by reference [63] proposed a new dictionary learning model for a mixture of Gaussian (MOG) distribution. The method was used for the expectation–maximization framework [64]. A minimization problem that uses the sparse coding and dictionary updating with quantitative and visual comparison was adopted. Specifically, this method was used to learn hierarchical mapping functions, and to prevent vanishing problems, Zhang et al. [65] proposed the separation aggregation network (SANet). SANet used three blocks (convolutional separation block, deep mapping block, and band aggregation block) to remove noise. The convolution separation block decomposed the input noise into sub-blocks [66, 67]. Then, each band was mapped into a clean and latent form using the convolution and ReLU layers. Finally, the band aggregation block concatenates all maps and convolutes features to produce the output. The SANet model was inspired by the non-local patch (NPL) model [67]. NPL model consists of patch grouping, transformation, and patch aggregation. Residual images obtained by learning the difference between noisy and clean image pairs produce loss of information. This information is important in producing an effective noise-free output image. Reference [68] proposed the detail retaining CNN (DRCNN) to navigate between noisy and clean pairs without losing information. The model (DRCNN) focused on the integrity of high-frequency image content and produces better generalization ability. A minimization problem was analyzed, designed, and solved from the detail loss function. DRCNN has two modules: the generalization module (GM), and the detail retaining module (DRM). GM involves convolution layers with a stride of 1, while DRM involves several convolution layers. Unlike several architectures, DRCNN does not have BN.

Computation cost is an emerging problem in CNN applications, a very large network always occupies a large memory space and requires high computational capacity. These networks are unsuitable for applications on smart and portable devices. Because of the above problem, Yin et al. [69] proposed a side window CNN (SW-CNN) for image filtering. SW-CNN has two parts: side kernel convolution (SKC), fusion, and regression (FR). SKC aligns slide or corner of operation window with the target pixel to preserve edges. SKC was combined with CNN to provide effective representation power. A residual learning strategy [70] was adopted to map layers. FR involves two convolutional phases consisting of three operations: pattern expression, non-linear mapping, and weight calculations. The pattern expression calculates the gradient from the feature map tensor to produce a pattern tensor. Non-linear mapping convolutes the pattern tensor with different kernels to produce a tensor with (Hxwxd) dimension. Finally, the weight calculations generated the weighting coefficient of each pixel.

Single noise reduction with CNN is a difficult task. A more difficult task is to remove mixed noise from an image using CNN. Most mixed noise removal algorithms involve pre-processing outlier. Reference [71] proposed the denoising-based generative adversarial network (DeGAN) for removing mixed noise from images. The generative adversarial network (GAN) [72] has been widely used in deep learning applications. The DeGAN involved the generator, discriminator, and feature extractor network. The generator network used the U-Net [73] architecture, while the discriminator network consists of 10 end-to-end design layers. The main purpose of the discriminator network was to check whether the image estimated by U-Net (extractor network) was noise free. Finally, the feature extraction network used the VGG19 [74] to extract features and to assist the model training by calculating the loss function (see Fig. 4).

Fig. 4
figure 4

DeGAN [71]

Xu et al. [75] proposed the Bayesian deep matrix factorization (BDMF) for multiple image denoising. BDMF used the deep neural network (DNN) for low-rank components and optimization via stochastic gradient variation Bayes [76,77,78]. The network is a combination of the deep matrix factorization (DMF) network and the Bayesian method. Synthetic and hyperspectral images were used to evaluate the methods. Reference [79] proposed the classifier/regression CNN for image denoising. The regression network was used for restoring the noisy pixel identified by the classifier network, while the classifier network detects impulse noise. The classifier network involves convolution, BN, ReLU, a softmax, and a skip connection. Meanwhile, based on the label predicted by the classifier network, the regression network used four layers and a skip connection to predict clean images (see Fig. 5).

Fig. 5
figure 5

Classifier/regression CNN [79]

Reference [80] proposed the complex-valued CNN (CDNet) for image denoising. First, the input image was passed to 24 sequential connected convolutional units (SCCU). SCCU involve the complex-valued (CV) convolutional layer, complex-valued (CV) ReLU, and complex-valued (CV) BN. A 64 convolutional kernel was used in the network. The residual block was implemented for the middle 18 units. The convolution/deconvolution layer with a stride of 2 was used to improve computational efficiency. Finally, the merging layer transformed the complex-valued features into a real-value image. Overall, CDNet has five blocks: complex-valued (CV) Conv, complex-valued (CV) ReLU, complex-valued (CV) BN, complex-valued (CV) residual block (RB), and merging layer (see Fig. 6).

Fig. 6
figure 6

Complex-value CNN [80]

Zhang et al. [81] proposed the detection and reconstruction of CNN for color images (3-channel). The method has three networks; classifier network, denoiser network, and reconstruction network. The classifier network predicts color channels to determine the probability of impulse noise in the image. Decision-maker procedures (that compute the label vector of each pixel) were employed to ascertain noisy or noise-free color pixels. Sparse clean image replaced corrupted channels (0 for noise free). Finally, the denoised image was reconstructed by the image reconstruction architecture. In a nutshell, the classifier network (consist of convolution, and ReLU layers) predicts the probability of channels, then the denoiser network (consist of convolution, BN, and ReLU layers) corresponds to the noise-free color pixel, while the reconstruction network (has only convolutions) reconstructs the images. Although the networks have the same structures, the depth and the number of nodes are different. Adaptive moment estimation (Adam) [82] was used to optimize the networks.

Reference [83] proposed the CNN variation model (CNN-VM) for denoising of images. The CNN used in this research was termed EdgeNet and it consists of multiple scale residual blocks (MSRB). EdgeNet extracts feature from the noisy image through an edge regularization method. The total variation regularization was used to obtain superior performance in the shape edge. The Bregman splitting method was used to obtain solutions to the model. MSRB employed a kernel of two for each bypass to detect local features. A skip connection was used for inputting data and to generate output features. Another skip connection was used after each MSRB block with a bottleneck layer fusing detected features. Four MSRB blocks were adopted in the EdgeNet training procedure. A comparison of different methods in this section is available in Tables 1 and 2.

Table 1 Comparison of CNN denoising methods for general images
Table 2 Advantages and disadvantages of CNN denoising methods for general images

CNN denoising for specific images

Islam et al. [84] proposed a feedforward CNN method to remove mixed noise (Gaussian–impulse). The method adopts a computational efficiency transfer learning approach for noise removal. The model consists of a pre-processing, and four convolution filtering stages. A rank order filtering operation was applied to each stage and the convolution layer preceded the ReLU and max-pooling layers. The output of the first stage was fed into the ReLU and the output of the ReLU was pooled (max pooling). The second and third stages used convolution and ReLU layers, and the last stage adopts the convolution layer. A back-propagation algorithm (with differentiable and traceable loss function) was used to train the model. Finally, the model used a data argumentation [85, 86] for effective learning. Another research by Tian et al. [87] proposed a deep learning method based on U-Net [73] and Noise2Noise [88] method. First, the noise was validated on computer-generated holography (CGH) images. Then, the classical Gerchberg–Saxton (GS) algorithm [89] was used to generate different holograms (two-phase). Next, the noise reduction mechanism (UNET and Noise2Noise) was obtained. Finally, the MSE was used as the loss function and the learning rate was set at 0.001. Like the previous method, the MSE was adopted as the loss function; apparently, MSE can act as a good loss function in image denoising.

Reference [90] proposed the spectral–spatial denoising residual network (SSDRN). SSDRN used the spectral difference [91] mapping method based on CNN with residual learning for denoising. The network was an end-to-end algorithm that preserves spectral profile and removes noise. A key band was selected based on a principal transform matrix with DnCNN [40]. Overall, SSDRN involves three parts: spectral difference learning, key band selection, and denoising (by DnCNN) model. Unlike most CNN denoising models, SSDRN used the batch normalization [92] layer in each block of the algorithm. Reference [93] proposed the patch group deep learning for image denoising. A training set with a patch group was created and then the deep learning method [94, 95] was used to reduce the noise. Reference [96] developed an end-to-end deep neural network (DDANet) for computational ghost image reconstruction. DDANet used a bucket signal with multiple tunable noise-level maps. A clear image was outputted after training with the simulated bucket signals and the ground-truth image. DDANet has 21 layers that include: fully connected layers, dense blocks, and convolution layers. The inputs, transformation, noise adding [97] encoding [98], and object recovery layers were used in the DDANet architecture. A skip connection [99, 100] for passing high-frequency feature information was utilized. The attention gate (AG) [101] and dilated convolution were used to filter the features. Finally, the dropout layer [102] was used to avoid overfitting, while the BN accelerated loss function.

Zhang et al. [103] proposed the deep spatio-spectral Bayesian posterior network (DSSBPNet) for hyperspectral images. A blend of Bayesian variation posterior and deep neural network produced the DSSBPNet. Specifically, the method was divided into two parts: deep spatio-spectral (DSS) network and Bayesian posterior. The DSS network split the input image into three parts producing a spatio-spectral gradient [104] for each part. Different convolutions were used in the DSS network. Meanwhile, the likelihood of original data, noise estimate, noise distribution, and sparse noise gradient constitute the Bayesian posterior method. Finally, a forward–backward propagation method was used to connect the DSS with the Bayesian posterior. Reference [105] proposed the two-stage cascaded residual CNN to remove mixed noise from infrared images. The model used the mixed convolutional layer combining dilated convolutions, sub-pixel convolutions, and standard convolutions to extract and improve accuracy. A residual learning method was used to estimate the calibration parameter from the input image. Five feature extraction blocks (FEBs) used the coarse–fine convolution unit (CF-Conv) and the spatial-channel noise attention unit (SCNAU) to stack noise features. The last convolution layer for each network consists of a single filter with kernel size (see Fig. 7). Giannatou et al. [106] proposed the residual learning CNN (SEMD) for noise removal in scanning electron microscopic images. SEMD is a residual learning method inspired by the DnCNN and trained to estimate the noise at each pixel of a noisy image. The input block in SEMD consists of a convolutional layer followed by a ReLU and BN. The output block consists of a convolution with one filter for reconstruction. Jiang et al. [107] proposed the generative adversarial network based on the deep network for denoising of underwater images (UDnNet). UDnNet consists of two sub-networks: a generator network, and a discriminator network. The generator network generates realistic samples using the training procedure, asymmetric codec structure, and a skip connection. The output of the generative network was processed by the convolution-instance Norm-Leaky ReLU. A deconvolution-instance Norm-Leaky ReLU decodes the features.

Fig. 7
figure 7

Two-phased cascaded residual CNN [105]

Reference [108] combined the bilateral filter, the hybrid optimization, and the CNN to remove noise. The bilateral filter [100, 109] was used to remove noise, while the hybrid optimization used the swarm insight strategy [110] to preserve edges. Finally, a CNN classifier (with convolution layers, pooling layer with feature extraction, and fully connected layer) was used to classify the image. For the evaluation procedure, the peak signal to noise ratio, vector root mean square error, structural similarity index, and root mean square error was adopted [8, 111]. A major challenge when using CNN for speckle reduction is labeling. Ultrasound images are not labeled; hence, it becomes very difficult for deep learning to identify speckle. Feng et al. [112] proposed a hybrid CNN method for speckle reduction. The method involves a three knowledge system. Since speckle noise was similar to Gaussian distribution in the logarithm transform domain, the distribution parameters were also estimated in the logarithm transformation domain with maximum likelihood estimation. Second, a transfer denoising network was trained with a clean natural image dataset. Finally, the VGGNet was used to extract structural boundaries from the trained images. Overall, the transferable denoising network was trained based on Gaussian prior knowledge of Ultrasound clean images. Then, fine-tuning of the pre-trained network with prior knowledge of structural boundaries was performed. Ultrasound images (breast, liver, and spinal) and artificially generated phantom (AGP) images were used to evaluate the method.

Reference [113] used the pre-trained residual learning network (RLN) for despeckling of ultrasound images. The model consists of noise and pre-trained RLN models. A noise model was created from the training dataset, and then random patches were generated from the speckle noise images. The RLN was then used to train the random patches, and a despeckled image was created. The pre-trained RLN has 59 layers (consist of Conv, ReLU, and BN) for training and testing. The method was tested with artificial and natural images corrupted with speckle noise (see Fig. 8).

Fig. 8
figure 8

Pre-trainedRLN [113]

Kim and Lee [114] proposed the conditional generative adversarial network (CGAN) for noise reduction in low-dose chest images. CGAN involves the generative model [115], discriminator model [116], and the prediction model. The generator model has 14 layers and focused on synthesized realistic images from random vector sample noise distribution. Meanwhile, the discriminator model has 4 layers and trains on ground-truth images. The tensor library was used to accomplish the CGAN architecture. Li et al. [117] proposed a progressive network learning strategy (PNLS) that fits the Racian distribution with large convolutional filters. The network consists of two residual blocks (used for fitting pixel domain, and matching pixel domains). The first residual block used the Conv, and ReLU layers without the BN layer. The second residual block used the Conv, ReLU, and BN layers. Each block has 5 layers with three convolution layers acting as the intermediary between blocks (see Fig. 9).

Fig. 9
figure 9

Progressive network learning strategy [117]

Reference [118] proposed a novel CNN method for denoising MRI scans (CNN-DMRI). The network used convolutions to separate image features from noise. CNN-DMRI is an encoder–decoder structure for preserving important features and ignoring the unimportant ones. The neural network learns prior features from the image domain and produced clean images. A down-sampling and up-sampling factor of 2 was adopted. CNN-DMRI is a four-layer network; the first two layers have 64 filters followed by CONV layers. The down-sampling layer has 128 filters followed by 4 residual blocks and a 64 up-sampling filter. Finally, a concatenation of the noisy image and the network was performed to produce a clean MRI. Comparison of different methods in this section is available in Tables 3 and 4.

Table 3 Comparison of CNN denoising methods for specific images
Table 4 Advantages and disadvantages of CNN denoising methods for specific images

CNN image denoising performance measures

Performance evaluations are key indices in image denoising. Over the years, researchers have used different objective evaluation methods for CNN image denoising. Below are different evaluation methods adopted by researchers in CNN denoising.

The mean square error (MSE): Is the average of the square of the difference between the original image and the denoised image. Lower MSE values signify better image quality.

$$ {\text{MSE}} = \frac{1}{N}\left| {\left| {I - L} \right|} \right|^{2} . $$
(1)

Peak signal to noise ratio (PSNR): is determined through the MSE. It is an engineering term that measures the ratio between maximum original signal and MSE. Higher PSNR values signify better image quality.

$$ {\text{PSNR}} = 10{\text{*log}}_{{10}} \left( {\frac{{\left( {\max \left( I \right)} \right)^{2} }}{{MSE}}} \right).{\text{~}} $$
(2)

Structural similarity index measure (SSIM): measure perceptual difference (such as luminance, contrast, and structure) of two similar images. Higher SSIM values signify better image quality.

$$ {\text{SSIM}}\left( {u,v} \right) = \frac{{\left( {2\mu _{I} \mu _{L} + Q_{1} } \right)\left( {2\sigma _{{IL}} + Q_{2} } \right)}}{{\left( {\mu _{I} ^{2} + \mu _{L} ^{2} + ~Q_{1} } \right)\left( {\sigma _{I} ^{2} + \sigma _{L} ^{2} + ~Q_{2} } \right)}}~, $$
(3)

where \(\mu _{I}\) and \(\mu _{L}\) are the average gray values, \(\sigma _{I}\) and \(\sigma _{L}\) are the variance of patches, \(\sigma _{{IL}}\) is the covariance of I and L, and Q1 and Q2 denote two small positive constants (typically 0.01).

Root mean square error (RMSE): measure the difference between estimated predictions and actual observed values. The MSE is the scale square of RMSE. The RMSE between two image metrics (P, Q) is:

$$ {\text{RMSE}} = ~\sqrt {{\text{MSE}}~\left( {P,Q} \right)} ~ = ~\sqrt {\mathop \sum \limits_{{p = 1}}^{m} \mathop \sum \limits_{{q = 1}}^{n} (P_{{pq}} - Q_{{pq}} )^{2} } . $$
(4)

Feature Similarity (FSIM and FSIMc): is designed for gray-scale images and luminance components of color images. It computes local similarity maps and pools these maps into a single similarity score.

$$ {\text{FSIM}} = ~\frac{{\mathop \sum \nolimits_{{x \in \Omega }} S_{L} \left( x \right).PC_{m} ~\left( x \right)}}{{\mathop \sum \nolimits_{{x \in \Omega }} PC_{m} ~\left( x \right)}}, $$
(5)
$$ {\text{FSIMc}} = ~\frac{{\mathop \sum \nolimits_{{x \in \Omega }} S_{L} \left( x \right).\left[ {S_{c} \left( x \right)} \right]^{\lambda } .PC_{m} ~\left( x \right)}}{{\mathop \sum \nolimits_{{x \in \Omega }} PC_{m} ~\left( x \right)}}~. $$
(6)

To learn more about the FSIM and FSIMc see reference [119].

The signal to noise ratio (SNR): measures noise level relative to the original image as follows.

$$ {\text{SNR}} = 10{\text{log}}_{{10}} \frac{{\left| {\left| L \right|} \right|}}{{\left| {\left| {{\text{I}} - {\text{L}}} \right|} \right|}}. $$
(7)

The Spectral Angle mapper (SAM), and the Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [120] are used with other evaluation methods in remote sensing images. Overall, the PSNR and the SSIM are the most widely used evaluation methods for CNN denoising. These two methods are popular because they are easy and are considered tested and valid [121].

Datasets

This section provides a list of datasets used for CNN image denoising algorithms. They include: ImageNet large scale visual recognition challenge object detection (ILSVRC-DET) [122], Places2 [123], BerkeleySegmentation Dataset (BSD) [124], waterloo explorationdatabase [125], EMNIST [126], COCO dataset [127], MIT-Adobe five k [128], ImageNet [129], BSD68 [130], Set 14 [131], Renoir [132], NC12 [133], NAM [134], SIDD [135], SUN397 [136], Set5 [137], CAVE [138],Harvard database [139], MRI brain dataset [140], LIVE1, Chelsea,DIV2K dataset [141], HIS dataset Xiongan, first Affiliated Hospital of Sun Yat-Sen University, Shenzen third people’s Hospital, Artificially generated phantom (AGP) [142], Ultrasound dataset [143, 144], SPIE American association of physicist in medicine lung CT-challenge database [145], SIAT-CAS MRI dataset, Brainweb [146, 147], IXI dataset [148], Multiple sclerosis [149], Prostrate MRI [150], ThammasatUniversity Hospital dataset [151]. A few samples of data in the dataset used by researchers for CNN denoising are shown in Fig. 10. Similarly, Fig. 11 is the graph of datasets used for evaluating CNN denoising methods. However, the Berkeley segmentation dataset has the highest usage because it is particularly suited for image denoising research. Three major points that matter when selecting datasets are relevance, usability, and quality [47, 152]. Hence, we believe that datasets with higher usage for CNN denoising tasks have these aforementioned points. It should be noted that some datasets were not recorded in the graph (Fig. 10). The reason for this was because they have a fewer appearance in CNN denoising researchers.

Fig. 10
figure 10

A few samples of images in datasets used by researchers

Fig. 11
figure 11

Datasets for CNN-IQA methods

Analysis

A total of 152 references were included in this paper, specifically, 31 research papers related to CNN image denoising. In this review, a conscious effort was made to include all research articles relating to CNN image denoising; however, some studies might have been skipped. A graph depicting the number and the publication year for CNN image denoising methods is available in Fig. 12. From the figure, it is clear that researchers have just recently adopted CNN for image denoising. This is open research for further experimentation and exploration. Finally, the graph of image type adopted by the CNN image denoising methods is available in Fig. 13

Fig. 12
figure 12

Number of papers published yearly

Fig. 13
figure 13

List of image types

Conclusions and future directions

Recently, CNN architectures are becoming quite useful in image denoising. We have proposed a survey of different techniques relating to CNN image denoising. A clear understanding of different concepts and methods was elucidated to give readers a grasp of recent trends. Several techniques for CNN denoising have been enumerated. A total of 144 references were included in this paper. From the study, we observed that the GAN was the most used method for CNN image denoising. Several methods used the generator and the discriminator for extraction and clean image generation. Interestingly, some researchers combined the GAN method with the DCNN methods. The feedforward CNN and U-Net were also used. The residual network was used severally by researchers. A reason for the high usage of the residual network could be its effectiveness and efficiency. Researchers used the residual network to limit the numbers of convolutions in their network. A creative measure adopted by the researcher was to try to mixed noise (impulse Gaussian noise). To reduce mixed noise in images, several careful deep convolutions were required. The Rician and speckle noise is common in medical images. Pre-trained networks have worked excellently in medical image noise reduction. The database from Berkeley was the most used in CNN image denoising. In addition, the attention mechanism and residual networks are commonly used CNN techniques in image denoising tasks. The reason for such wide acceptance is because of their popularity and effectiveness in image denoising.

Some problems confronted by CNN image denoising methods include not enough memory for CNN applications, and difficulty in solving unsupervised denoising tasks. Conclusively, only very few CNN methods were used for medical images. It will be encouraging if more CNN methods could be applied to denoise medical images. In addition, the authors try to collect codes and software; however, it was not available. The provision of more memory allocations for the CNN task will be very helpful. This could be a research area for future discussion.

The findings of the review can be summarized below:

  • From the available literature, it is clear that CNN can considerably remove all kinds of noise from images and advance capability in image denoising. Several studies reported higher performance of CNN architecture for image denoising. CNN architectures support end-to-end procedures and are implemented promptly.

  • CNN architecture can be customized for noise removal tasks creating patterns that remove the bottleneck of vanishing gradients.

  • CNN methods are designed using technical knowledge and principles in concert with understanding the noise type and noise models.

  • Most studies used pre-trained CNN models; however, noise properties are in a continuous nature and need a model built from scratch. Building such a model creates room for readjustment and fine-tuning. However, building a model from the beginning require lots of computation space and time. With the introduction of cloud-based methods (e.g., COLAB), it is hoped that the problem of space and time would be resolved.

  • The use of spatial patterns in CNN architecture could create a shift from conventional methods to deep learning methods. Contrary to perceptions that CNN is a black box, features visualization methods provide a trusted platform for noise removal, however, the greatest challenge remains the computational time and space.