1 Introduction

Image fusion technology, which aims to combine images obtained from different sensors to create a single and rich fused image [1], has been widely used in medical imaging [2, 3], remote sensing [4,5,6], object recognition [7, 8], and detection [9]. Among the combination of different types of images, infrared and visible image fusion has attracted increasing attention [10]. The infrared images record the thermal radiation of the scene, thus, the target in the infrared images is prominent and obvious. However, the infrared images have less detail information, low contrast, poor visual effects, and poor imaging performance. In contrast, the visible images can provide abundant detail information, while the target will be inconspicuous and easily influenced by smoke, bad weather conditions, and other factors. Therefore, fusion of the two types of the images can compensate for the insufficient imaging competence of infrared and visible sensors [11]. The final fused image can possess clearer scene information as well as better target characteristics [12].

There are seven main fusion methods: multi-scale geometric analysis (MGA)-based, sparse representation-based [13,14,15], neural network-based [16, 17], subspace-based [18], saliency-based methods [19], hybrid models [20], and other methods. Among them, MGA-based methods are the most popular. MGA-based methods assume that the images can be represented by different coefficients in different scale. These methods decompose the source images into low- and high-component bands, combine the corresponding bands with specific fusion rules, and reconstruct the fused image with the inverse MGA transform [21]. The key to MGA-based methods is the MGA transform, which decides the amount of the useful information that can be extracted from source images and integrated in the fused image. Popular transforms used for decomposition and reconstruction include wavelet transform [22] (WT), wedgelet transform [23], curvelet transform [24, 25], contourlet transform [26], NSCT [27, 28], shearlet transform [29] (ST), non-subsampled shearlet transform [30] (NSST), and so on. Due to the characteristics of shift-invariant, high sensitivity, strong directivity, fast operation speed, and multi-directional processing, NSST has been widely used in the image fusion [31]. Many researches have shown that NSST is more consistent with human visual characteristics than other MGA transforms, and the performance can make the fused images have better visual effects [32]. However, it may be inappropriate for the infrared and visible image fusion. In infrared images, the target information is significant and easy to detect and recognize. While in visible images, the detailed information is mainly provided by gradients. Therefore, adopting the same representation for the two types of images will cause the thermal radiation target inconspicuous, which can hardly be distinguished from the background. In MGA-based fusion methods, it is difficult to keep the thermal radiation in infrared images and appearance information in visible images simultaneously.

To overcome the problem, we proposed a new fusion algorithm based on nonlinear enhancement and NSST decomposition for the infrared and visible images. Firstly, the NSST is used to decompose the two source images into low- and high-frequency sub-bands. Then, the high-frequency sub-bands are fused with WT-based method. To highlight the target, we construct a non-linear transform function to determine the fusion weight of low-frequency sub-bands, and whose parameters can be further adjusted to meet different fusion requirements. Finally, the inverse NSST is used to reconstruct the fused image. The experiments demonstrate that the proposed method can not only enhance the thermal target in infrared images, but also preserve the texture details in visible images. The presented method is competitive with or even superior to other methods in terms of both visual and quantitative evaluations.

The rest of this paper is organized as follows. The principle theoretical base and implementation steps of NSST are reviewed in Section 2. The details of the proposed image fusion method are proposed in Section 3. Experimental results and comparisons are presented in Section 4. The main conclusion of this paper is drawn in Section 5.

2 Related works

NSST is one of the most suitable multi-scale geometric analysis tools for fusion applications. The NSST provides an elegant sparse image representation with edges and much detail information. It does not introduce artifacts or noise when the inverse NSST is performed. In addition, the shearlet coefficients are well-localized in tight frames ranging at various locations, scales with anisotropic orientation. This achieves a successful fusion process and produces higher image quality and more clearness of image details and edges [33].

2.1 Basic principle of NSST

The shearlet construction is based on the non-sampled pyramid filter banks that provide the multi-scale decomposition and directional filtering generated using shear matrix that provides multi-directional localization. When the dimension n = 2, the affine system with synthetic expansion is AAB(ψ) [10].

$$ {A}_{AB}\left(\psi \right)=\left\{{\psi}_{j,l,k}(x)={\left|\det A\right|}_2^j\psi \left({B}^l{A}^jx-k\right):j,l\in Z,k\in {Z}^2\right\} $$
(1)

where ψ ∈ L2(R2), A and B are 2 × 2 invertible matrices and |detB| = 1. If AAB(ψ) forms a Parseval tight framework for L2(R2), the elements of the system are called composite wavelets. For any f ∈ L2(R2), there is

$$ {\sum}_{j,k,l}{\left|<f,{\psi}_{j,l,k}>\right|}^2={\left\Vert f\right\Vert}^2 $$
(2)

Among them, matrix Aj and Blare respectively associated with scale and geometric transformations, such as rotation and shear operations.

Where \( {A}_a=\left(\begin{array}{cc}a& 0\\ {}0& \sqrt{a}\end{array}\right) \), \( {B}_s=\left(\begin{array}{cc}1& s\\ {}0& 1\end{array}\right) \), the system can be shown as follows:

$$ \left\{{\psi}_{ast}(x)=a-\frac{3}{4}\psi \left({A}_a^{-1}x-t\right),a\in {R}^{+},s\in R,t\in {R}^2\right\} $$
(3)

Equation (3) is a shearlet system, and ψast(x) is a shearlet.

Figure 1 shows the tiling of the frequency plane induced by the shearlets and frequency supports of shearlet elements. It can be seen from Fig. 1 that each element \( {\hat{\psi}}_{j,l,k}(x) \) is supported on a pair of trapezoidal pairs with the size of about 2j × 22j, and the direction is along a straight line with a slope of l2j.

Fig 1
figure 1

The tiling of the frequency plane induced by the shearlet and frequency support of shearlet elements

2.2 Implementation steps

The NSST can be realized through two steps:

(1) Multi-scale decomposition. The nonsubsampled pyramid (NSP) filter bank decomposes each source image into a set of high- and low-frequency sub-images to attain multi-resolution decomposition. Firstly, the source image is decomposed into the low- and high-frequency coefficients with NSP. Then, the NSP decomposition of each layer will iterate on the low-frequency components obtained by the upper layer decomposition to get the singular points. Without the down-sampling operation, the sub-band images will have the same size as the source image. Finally, for j level decomposition, we can obtain a low-pass image and j band-pass images.

(2) Directional localization. The shearlet filter bank decomposes these high-frequency sub-images to attain multi-direction decomposition. Firstly, the pseudo polarization coordinates are mapped to Cartesian coordinates. Then, the “Meyer” wavelet is used to construct window function and generate shearlet filters. Finally, the sub-band image is convoluted with “Meyer” window function to obtain the direction sub-band images.

The two-level decomposition structure is shown in Fig. 2. The NSP decomposes the source image f into a low-pass filtered image \( {f}_a^1 \) and a high pass filtered image \( {f}_d^1 \). In each iteration, the NSP decomposes the low-pass filtered image from the upper layer until the specified number of decomposition layers is reached. Finally, a low-pass low-frequency image and a series of high-frequency images are obtained.

Fig. 2
figure 2

Two level decomposition diagram of NSST

3 Proposed method

In this section, we introduce the process of the proposed method and discuss the setting of parameters. The low- and high-frequency components obtained from the NSST decomposition represent different feature information. For example, the low-frequency components carry the approximate features of the source image, and the high-frequency components carry the detailed features. The approximate parts of images provide more visually significant information and contrast information. The detailed parts of images provide more contour and edges information. Therefore, we should use different fusion rules to fuse the low- and high-frequency components. According to the stage of image data to be fused and the degree of information extraction in the fusion system, image fusion is divided into three levels: pixel level, feature level, and decision level. The proposed method focuses on the pixel level. The specific fusion scheme is shown in Fig. 3. The steps of proposed method are as follows:

  • Step 1: Decompose the infrared and visible images with NSST into low- and high-frequency coefficients.

  • Step 2: Fuse low-frequency coefficients based on nonlinear enhancement algorithm.

  • Step 3: Fuse high-frequency coefficients based on WT-based method.

  • Step 4: Apply inverse NSST to obtain the fused image.

Fig. 3
figure 3

The diagram of fusion scheme

3.1 Low-frequency sub-band fusion

The low-frequency components reflect the contour information of the image, which contain a lot of energy information of the original image [34]. Weighted average method is commonly used to fuse low-frequency sub-bands; however, unreasonable fusion weight will cause loss of source image information or poor image performance. We introduce a fusion strategy that construct a nonlinear transform function to determine the fusion weight of the low-frequency sub-bands to address the problems.

In infrared images, the target information is significant. Due to the large gray values, the target is easy to detect and recognize. In order to highlight the target in the fused image, we extract the coefficients in the low-frequency component of the infrared image to determine the low-frequency fusion weight.

Each coefficient of the low-frequency components takes the absolute value as follows:

$$ R=\left|{\mathrm{LFC}}_{\mathrm{IR}}\right| $$
(4)

Where LFCIR represents the low-frequency sub-band of the infrared image after decomposition, R represents the significant infrared characteristic distribution. Rmean means the average of the LFCIR. When R is larger than Rmean, it can be considered as a bright point; when R is smaller than Rmean, it can be considered as a dark point. The bright points are regarded as the target, while the dark points are regarded as backgrounds. In order to highlight the target, a nonlinear transform function is introduced to control the degree of the enhancement. The nonlinear transform function is as follows:

$$ S\left(\lambda \right)=1-\frac{1}{1+{\left(\frac{R}{R_{\mathrm{mean}}}\right)}^{\lambda }} $$
(5)

where the parameter λ belongs to (0, ∞).

The low-frequency information fusion weight can be expressed as:

$$ {C}_{\mathrm{IR}}=S\left(\lambda \right) $$
(6)
$$ {C}_{\mathrm{VIS}}=1-{C}_{\mathrm{IR}} $$
(7)

Where CIR is the fusion weight of the infrared image, CVIS is the weight of visible image, and they both belong to [0, 1].

As shown in Eqs. 57, the parameter λ directly affects the fusion weight of the infrared image. Therefore, we can adjust λ to control the proportion of the infrared features of the fused image. Particularly, the larger the value of CIR, the more obvious the target is. To strengthen the thermal radiation target, the value of CIR should be relatively large.

The final low-frequency sub-band fusion result can be obtained as follows:

$$ \mathrm{LFC}\_\mathrm{F}={C}_{\mathrm{IR}}\ast {\mathrm{LFC}}_{\mathrm{IR}}+{C}_{\mathrm{VIS}}\ast {\mathrm{LFC}}_{\mathrm{VIS}} $$
(8)

where LFC _ F represents the low-frequency component of the fused image. LFCVIS represents the low-frequency component decomposed by visible images.

3.2 High-frequency sub-band fusion

High-frequency components reflect detailed information, such as edges and contours of the source image. To obtain more detailed information, we use the WT-based method to fuse the high frequency sub-bands of the infrared and visible images. Firstly, the WT is used to decompose high-frequency sub-bands to obtain approximate sub-bands (LFCIR and LFCVIS) and directional detail sub-bands (HFCIR and HFCVIS). Here, Haar wavelet is selected as the WT basis, and the decomposition layers are set to 1. Then, the “average” fusion rule is performed for fusion for approximate sub-bands. The approximate sub-band fusion rule is defined as follows:

$$ \mathrm{LFC}\_\mathrm{F}=\frac{\left|{\mathrm{LFC}}_{\mathrm{IR}}\right|+\left|{\mathrm{LFC}}_{\mathrm{VIS}}\right|}{2} $$
(9)

And the “max-absolute” fusion rule is performed for fusion for directional detail sub-bands. The directional detail sub-band fusion rule can be expressed as follows:

$$ \mathrm{HFC}\_\mathrm{F}=\left\{\begin{array}{l}{\mathrm{HFC}}_{\mathrm{IR}}\kern0.48em ,\left|{\mathrm{HFC}}_{\mathrm{IR}}\right|>\left|{\mathrm{HFC}}_{\mathrm{VIS}}\right|\\ {}{\mathrm{HFC}}_{\mathrm{VIS}},\mathrm{otherwise}\end{array}\right. $$
(10)

where LFC _ F and HFC _ F represent the approximate and directional detail sub-bands of high-frequency sub-band images.

Finally, the inverse WT is implemented on LFC _ F and HFC _ F to get the high-frequency sub-bands of the fused image.

3.3 Analysis of parameter

In the nonlinear enhancement method, there is a main parameter which influence the enhancement performance, namely, the parameter λ. In this section, we draw the curve of the enhancement weight CIR under different parameter λ shown in Fig. 4. The intensity of the target pixel in the fused image is determined by the value of CIR. The larger the value of CIR, the more evident the target is.

Fig. 4
figure 4

The curve under different parameter λ

As shown in Fig. 4, the CIR curve with the the abscissa R (the gray level of the pixel) is “S” type, which shows that the target pixels can obtain larger enhancement than that of the background pixels. Moreover, the shape of CIR becomes steep when the parameter λ increases. Therefore, it is convenient to adjust λ to get different fusion result.

Figure 5 shows the fused images under the parameter λ of 5, 10, 30, 50, 100, and 200. As seen in Fig. 5, the pixel intensity distribution of infrared images is strengthened with the increase of λ. However, when λ reaches a certain degree, the distortion in the fused image will occur. The parameter λ should be appropriately large to meet different fusion requirements. In this paper, the value of λ is 10. The proposed algorithm is summarized as Table 1.

Fig. 5
figure 5

Fused images under different parameter λ. a Infrared image; b visible image; ch fused images at λ = 5, λ = 10, λ = 30, λ = 50, λ = 100, λ = 200, respectively. The differences are highlighted in red rectangles

Table 1 Algorithmic module

4 Experimental results and discussion

4.1 Experimental scheme

To evaluate the performance of the proposed algorithm, two groups of simulation experiments have been carried out. Firstly, we compare the proposed method with six MGA-based methods. Then, we compare our method with other five advanced methods. Finally, qualitative and quantitative analysis of experimental results is achieved. The infrared and visible images to be fused are collected from TNO Image Fusion Dataset. Our experiments are performed using MATLAB Code on a computer with 2.6 Hz Intel Core CPU and 4 GB memory.

4.2 Fusion quality evaluation

4.2.1 Subjective evaluation

The subjective evaluation methods assess the quality of the fused image according to the evaluator’s own experience and feeling. To some extent, it is a relatively simple, direct, fast, and convenient method. However, the lower efficiency and poorer real-time performance limit its practical applications. Table 2 shows the common used subjective evaluation criteria.

Table 2 Subjective evaluation criteria

4.2.2 Objective evaluation

According to the different subjects, the objective evaluation indicators of image fusion quality can be divided into three categories: the characteristics of the fusion image itself, the relationship between the fusion image and the standard reference image, and the relationship between the fusion image and the source images [10]. We use A, B, and F to infrared, visible, and fused image, respectively, and R to be the ideal reference image. Here are the five objective evaluation parameters we used.

  1. (1)

    Entropy (E)

E can be directly used to measure the richness of image information. The larger the E value, the better the fusion effects are. The calculation formula is shown in Eq. (11):

$$ E=-\sum \limits_{i=0}^{L-1}{p}_i\;{\mathit{\log}}_2\kern0.24em {p}_i $$
(11)

where L is the total number of gray levels of the image, and pi is the probability with the gray value i in the image.

  1. (2)

    Average gradient (AG)

AG is used to reflect the micro-detail contrast and texture variation in the image. The larger the AG value, the more gradient information the fused image contains. The calculation formula is shown in Eq. (12):

$$ \Delta \overline{G}=\frac{1}{M\times N}\sum \limits_{m=1}^M\sum \limits_{n=1}^N\sqrt{\frac{\Delta {F}_x^2\left(m,n\right)+\Delta {F}_y^2\left(m,n\right)}{2}} $$
(12)

where ΔFx is the difference in the x direction of the fused image F, and ΔFy is the difference in the y direction.

  1. (3)

    Standard deviation (SD)

SD is used to reflect the distribution of pixel gray values and the contrast of the fused image. It is defined as follows:

$$ \mathrm{SD}=\sqrt{\frac{1}{M\times N}\sum \limits_{m=1}^M\sum \limits_{n=1}^N{\left(F\left(m,n\right)-\overline{F}\left(m,n\right)\right)}^2} $$
(13)
  1. (4)

    Spatial frequency (SF)

SF is used to reflect the overall activity of the image in the spatial domain. The solution of SF is defined in Eq. (16). The larger the SF, the better the fusion effects are.

$$ \mathrm{RF}=\sqrt{\frac{1}{M\times N}\sum \limits_{m=1}^M\sum \limits_{n=1}^N{\left[F\left(m,n\right)-F\left(m,n-1\right)\right]}^2} $$
(14)
$$ \mathrm{CF}=\frac{1}{M\times N}\sum \limits_{m=1}^M\sum \limits_{n=1}^N{\left[F\left(m,n\right)-F\left(m-1,n\right)\right]}^2 $$
(15)
$$ \mathrm{SF}=\sqrt{{\mathrm{RF}}^2+{\mathrm{CF}}^2} $$
(16)

where RF and CF are the row and column frequency of image respectively.

  1. (5)

    Edge information retention (QAB/F)

QAB/F measures the amount of edge information that is transferred from the source image to the fused image. QAB/F is defined as follows:

$$ {Q}^{\mathrm{AB}/F}=\frac{\sum_{\forall m,n}{Q}_{m,n}^{\mathrm{AF}}{w}_{m,n}^A+{A}_{m,n}^{\mathrm{BF}}{w}_{m,n}^B}{\sum_{\forall m,n}{w}_{m,n}^A+{w}_{m,n}^B} $$
(17)

wA and wB denote the weight of the importance of infrared and visible images to the fused image. QAF and QBF are calculated from the edges. A large QAB/Fmeans that considerable edge information is transferred to the fused image. For a perfect fusion result, QAB/F is 1.

4.3 Experiments and results

4.3.1 Comparison with MGA-based methods

In the first group of simulation tests, we used the presented method to fuse five typical infrared and visible images in the TNO datasets, namely, “Men in front of house,” “Bunker,” “Sandpath,” “Kaptein_1123,” and “barbed_wire_2”. In addition, six MGA-based methods are selected for comparison experiments, including WT [23], TEMST [35], NSST with weighted average [36], NSST with WT [37], NSCT with WT [38], and CURV with WT [39].

The key of MGA-based fusion schemes is the selection of the transforms. WT- and CURV-based methods have block artifacts, reduce the contrast of the image, and cannot capture abundant directional information of images. NSCT-based method can capture the geometry of image edges well, while the number of the directions at every level is fixed. In NSST-based methods, the number of the directions can be set arbitrarily, and thus the more detailed information can be obtained. But the more directions, the longer running time is. We replaced LP with NSST in TEMST as a comparative experiment.

In the proposed method, the pyramid filter for NSST is set as “maxflat,” the decomposition level of NSST is set for 3, and the number of the directions is set for {4,4,4}. The high-frequency sub-bands are decomposed into 1 level by WT (with the basis of Harr). The results are shown in Fig. 6. The first two rows in Fig. 6 show the infrared and visible images. The six remaining rows denote the fused images of our method, TEMST, NSST with weighted average, WT, NSST with WT, NSCT with WT, and CURV with WT. The subjective and objective evaluation parameters introduced earlier are used to analyze the fusion results.

Fig. 6
figure 6

Fusion results on five typical infrared and visible images in the TNO datasets. a Men in front of house. b Bunker. c Sandpath. d Kaptein_1123. e barbed_wire_2. From top to bottom: infrared image, visible image, fused images of our method, TEMST, NSST with weighted average, WT, NSST with WT, NSCT with WT, and CURV with WT

The above five assessment indicators (i.e., E, AG, SD, SF, and QAB/F) on the five typical infrared and visible images are shown in Fig. 7. The larger their values, the better the fusion effects are.

Fig. 7
figure 7

Comparison of five evaluation parameters, including E, AG, SD, QAB/F, and SF. The seven methods are our method, TEMST, NSST with weighted average, WT, NSST with WT, NSCT with WT, and CURV with WT for five pairs datasets

4.3.2 Comparison with the state-of-the-art methods

In this part, seven typical infrared and visible images in the TNO datasets (i.e., men in front of house, bunker, soldier_behind_smoke_1, Nato_camp_sequence, Kaptein_1123, lake, and barbed_wire_1) are chosen to evaluate the effectiveness of the proposed method. We compare the proposed method with other 5 advanced methods, including: guided filtering-based weighted average technique (GF) [40], multi-resolution singular value decomposition (MSVD) [41], fourth order partial differential equations (FPDE) [42], different resolutions via total variation (DRTV) [43], and visual attention saliency guided joint sparse representation (SGJSR) [44].

The fused images are shown in Fig. 8. The values of the five evaluations metrics on the seven infrared and visible images are shown in Fig. 9.

Fig. 8
figure 8

Qualitative fused images on seven sets of typical infrared and visible images in the TNO_datasets. a Men in front of house; b Bunker; c Soldier_behind_smoke_1; d Nato_camp_sequence; e Kaptein_1123; f Lake; g Barbed_wire_1. From top to bottom: infrared image, visible image, the results of GF, MSVD, DRTV, SGJSR, FPDE, and our method. We pick out a small area (i.e., the red rectangle) in each fusion result, enlarge, and put it on the bottom

Fig. 9
figure 9

Comparison of five evaluation parameters, including E, AG, SD, QAB/F, and SF. The six methods are GF, MSVD, DRTV, SGJSR, FPDE, and our method

4.3.3 Results and discussion

As seen in Figs. 6, 7, 8, and 9, the above 12 methods can implement the effective fusion of infrared and visible images. In the other MGA-based methods, the fused image is dark and the target is not prominent, which can be clearly seen from the sky in images “Men in front of house” and “Kaptein_1123” in Fig. 6. It can be seen that the proposed method can achieve apparently and easily identifiable target information. In terms of objective evaluation parameters, our proposed method is generally higher than other methods as seen in Fig. 7. In short, the presented method in this paper is superior to other MGA-based methods.

Compared with five advanced methods, the presented method can achieve the best visual quality as shown in Fig. 8. However, analyzing the objective evaluation parameters (i.e., E, AG, SD, SF, and QAB/F) as seen in Fig. 9, there is a fluctuation. Our method cannot always get the highest values, but it can get the more stable image quality. In all, our method is competitive with the five advanced fusion methods.

5 Conclusions

In this study, we propose a new fusion algorithm for infrared and visible images based on nonlinear enhancement and NSST decomposition. It can be demonstrated that this algorithm can not only retain the texture details of the visible image, but also highlight the targets in the infrared image. Compared with other MGA-based and advanced algorithms, it is competitive or even superior in terms of qualitative and quantitative evaluation. And the fusion performance is beneficial for target detection and tracking in complex environments.