Regular article
Infrared small target segmentation with multiscale feature representation

https://doi.org/10.1016/j.infrared.2021.103755Get rights and content

Highlights

  • The local similarity pyramid module effectively captures multiscale features of infrared small targets.

  • The feature aggregation module considers to merge shallow and deep features with attention.

  • The proposed network outperforms other state-of-the-art methods.

  • The ablation study demonstrates the contribution of each module of proposed network.

Abstract

Small target segmentation is one of the vital techniques in various infrared-based applications. The typical challenges are summarized as follows: the sizes of infrared small target are extremely small compared with common targets, and infrared small targets with dim appearances are similar to the background noise. To address the above problem, this paper studies how to leverage the powerful pyramid structure and attention mechanism for the segmentation of infrared small targets. Multiple well-designed local similarity pyramid modules (LSPMs) are endowed with a strong capability to model the multiscale features of infrared small targets. Specifically, each LSPM with a different scale estimates the weight of the local similarity, which quantifies the degree to which a pixel is similar to other pixels. The pyramid features are introduced into the feature aggregation module as the supplement of the global features. The proposed network aggregates features with different weights that facilitate the fusion of shallow and deep features. We empirically evaluate the proposed network on public infrared small target segmentation datasets. The experimental results demonstrate that the network achieves better performance than other state-of-the-art methods. The code is publicly available at https://github.com/HuangLian126/LSPM.

Introduction

Small target segmentation for infrared images, which aims to allocate the labels of the class to each pixel in given infrared images, plays an essential role in computer vision tasks (e.g., maritime surveillance [1] and infrared guidance [2] among others). Distinct from the common RGB small targets, infrared small targets have their own obvious characteristics. First, due to the long distances between targets and infrared sensors, the sizes of targets are extremely small in infrared images, even containing only one pixel. In other words, background pixels are usually dominant in infrared images. Second, the appearance of the targets is very dim in infrared images since the infrared radiation energy obviously attenuates with the distance. Consequently, the target is easily confused with the background and sensor noise. The above characteristics significantly complicate infrared small target segmentation.

Traditional methods design hand-crafted features based on various assumptions. The early morphological filtering-based methods [3] need to optimize the parameters of the morphological structural elements for different scenes and cannot stably adapt to scenes in which the sizes of targets vary within a large range. The local contrast measure-based methods [5], [27], [28], [29], [30], whose core idea is to enhance the target and suppress the noise, assume that the target is brighter than the background. These methods usually slide a predefined window on an infrared image and calculate the ratio or difference between the central pixel of the window and the surrounding pixels as an indicator of local contrast. Pixels with higher local contrast are more likely to be the target. However, these methods fail to handle the cases in which the target is close to bright background because of the low local contrast measure of the target. The robust principal component analysis (RPCA)-base methods [4], [31] assume that the target and background can be approximately represented as low-rank and sparse matrices. However, this type of method has serious missed detections. Generally, traditional methods limited by multiple assumptions and prior knowledge cannot obtain better generality.

Recently, deep convolutional neural networks (CNNs), especially fully convolutional networks (FCNs) [6] have achieved remarkable progress in segmentation tasks [7], [8]. Meanwhile, infrared small target segmentation still belongs to the segmentation task. Naturally, there is a question: is it feasible to apply the deep CNN-based method for RGB images to segment infrared small targets? We investigate the distinction between conventional segmentation and infrared small target segmentation again, which is mainly reflected in the multiscale appearances of the targets to be segmented. Intuitively, the method can segment infrared small targets if endowed with a strong capability to model multiscale features. One solution is to implement powerful pyramid structures, such as pyramid pooling modules (PPMs) [9] and atrous spatial pyramid pooling (ASPP) [7], [10], that adequately capture the rich features of different targets at multiple scales. To address the other difficulty that infrared small targets are easily confused with background noise, inspired by [11], [12], we can accurately measure the local similarity by explicitly modeling the differences between pixels. After obtaining multiscale features, the methods consider how to aggregate these features, that is, how to decode. Existing methods usually leverage simple and direct merge operations [6], [8]. However, they may face the problem that these merge operations do not enable a seamless connection between shallow features and multiscale deep features [13]. Shallow features refer to low-level features that contain abundant coarse information for the edges, lines, and corners while deep features refer to high-level features that contain the abstract information of the target. Pang et al. [13] refer to this kind of difference as a semantic gap. A simple merging operation leads to the retention of coarse information in the newly fused features, and this coarse information may be regarded as background noise and affect the segmentation of infrared small targets.

Following the above idea, we propose a CNN-based method with multiscale feature representation to solve the segmentation of infrared small targets, and the overall pipeline is illustrated in Fig. 1. Networks should possess a powerful feature representation ability to model infrared targets due to their tiny appearances. To enhance the network with the representation learning of multi-scale targets, we built several pyramid modules in parallel with different scales. As mentioned above, this pyramid structure can capture the multiscale features of the target. Each scale corresponds to a local similarity pyramid module (LSPM). In addition, dim infrared targets can easily be confused with complicated background regions. To address this issue, the LSPM calculates the local similarity weight for each pixel, which quantifies the degree to which the pixel is similar to other pixels in feature maps. However, the LSPM may lose region information since not all pixels are equally important while the region that contain the target can produce useful information. Inspired by [12] the LSPM transforms the pixel-level similarity to the region-level similarity that represents the degree of similarity of the pixels in specific regions of feature maps. In this way, the LSPM further distinguishes an infrared small target and background by learning the region information. Finally, we stack several LSPMs in parallel to obtain pyramid features that have rich information on targets. Note that we add extra upsampling pyramid features in parallel, termed global guidance, as a supplement to feature aggregation module (FAM). The FAM contains three inputs: shallow features, deep features, and global guidance. It is reasonable to retain important feature maps and ignore irrelevant feature maps during the feature aggregation. We consider the fusion process from channel attention [14] that aggregate each input with a specific weight. In the above ways, the network can be competent to conduct infrared small target segmentation with the spontaneous perception of multiscale features. The contributions are summarized as follows:

We proposed infrared small object segmentation with multiscale feature representation and achieve state-of-the-art performance on benchmark datasets.

The multiscale feature pyramid and feature aggregation with channel attention boost the multiscale representation learning ability of the network.

Section snippets

Related work

Our network is highly related to the following aspects.

Infrared small target segmentation: Conventional methods usually leverage morphological filtering [3] local contrast measures [5], [27], [28], [29], [30] and RPCA [4], [31], followed by adaptive thresholding to segment infrared small targets. Zeng et al. [3] design the Top-Hat morphological filter with the parameters of morphological structural elements via global search. Wei et al. [5] construct a local contrast measure based on a

Methods

In this section, we start with an illustration of the local similarity pyramid module and then describe the feature aggregation module with channel attention.

Experiments

In this section, we first describe the benchmark datasets and metrics. We then provide the implementation details for our network. Next, the proposed network is qualitatively and quantitatively compared with state-of-the-art Infrared Small Target Segmentation (ISTS) method. Finally, we conduct an ablation study to demonstrate the effectiveness of our modules.

Conclusion

In this paper, we propose a network that leverages the local similarity pyramid module and feature aggregate module to segment infrared small targets. Extensive experiments illustrate the network can extract the multiscale features of infrared small targets and spontaneously distinguish target and noises with local similarity. The performance of network outperform existing state-of-the-art methods. Meanwhile, our network can extend to other generic small target object segmentation or semantic

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this papers.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61671094)

References (33)

  • H.S. Zhao, J.P. Shi, X.J. Qi, X.G. Wang, J.Y. Jia, Pyramid scene parsing network, IEEE Conf. Comput. Vis. Pattern...
  • L.C. Chen, Y.K. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with atrous separable convolution for semantic...
  • X.L. Wang, R. Girshick, A. Gupta, K.M. He, Non-local neural networks, in: IEEE Conf. Comput. Vis. Pattern Recog.,...
  • J.J. He et al.
  • Y.W. Pang, Y.Z. Li, J.B. Shen, L. Shao,Towards bridging semantic gap to improve semantic segmentation, in: IEEE Int....
  • J. Hu, Li Shen, G. Sun, Squeeze-and-Excitation Networks, in: IEEE Conf. Comput. Vis. Pattern Recog., (CVPR), IEEE, Salt...
  • Cited by (40)

    View all citing articles on Scopus
    View full text