The process of detecting and tracking maritime moving targets in this paper is shown in
Figure 3. After correcting the image sequence of the GF-4 satellite through geographic location information, the image is preprocessed to eliminate cloud noise. Through the ViBe background modeling method, we can extract the foreground image, that is, the binary image of the moving targets. Then, we calculate the center of mass of the target to obtain the pixel coordinates of targets. After geographic affine transformation, the real geographic location information of the targets in the world coordinate system can be obtained. Finally, through the DCF-CSR tracking algorithm, the targets in the sequence are tracked one by one and correspondingly associated.
3.1. Image Preprocessing
3.1.1. Geolocation Correction of Remote Sensing Images
Due to the fact that the GF-4 satellite is orbiting 36,000 km from the Earth, it is almost a hundred times the height of the low-Earth orbit. If the attitude of the satellite deviates by an angular second, the ground will deviate by several kilometers. Therefore, during the imaging process of the remote sensing satellite, various errors will occur when the sensor is stationary relative to the surface of the Earth. For example, the performance of the sensor itself, the deviation of the technical index from the standard value, etc. are internal errors. External errors are errors caused by unexpected factors of the sensor, such as changes in sensor position and orientation, uneven sensing medium, etc. These errors make it so that the image of the sensor is not on a geographic reference. In order to prevent these factors from interfering with the target detection of the image sequence, the RPC (Rapid Positioning Coefficient) parameter file is used to geometrically correct the image sequence during the preprocessing stage. The main steps include:
The left side of
Figure 4 below is an original image
, defined in the image coordinate system
; the right side,
, is the map coordinate system,
is the corrected image, and
indicates the range of the corrected image and the corresponding ground position in the computer.
The four corner points
of the original image are transformed into the map coordinate system according to the transformation model, and four coordinate values are obtained:
,
,
, and
. Then, the minimum values
and maximum values
of these 4 coordinate values are found using
X and
Y for the two coordinate groups:
Among them, , , , and are the map coordinate values of the four boundaries of the corrected image range. The total number of rows and columns of the corrected image can be calculated according to the ground size and of the output pixel. Thus, the image range and the image width and height can be obtained.
The geodetic latitude and longitude coordinates of the image points are calculated according to the image point coordinates of the image to be corrected and the RPC parameters. Given the elevation surface h, the image point coordinates of the image point on the original image are calculated from the geodetic coordinates of the object point elevation h. According to the bilinear interpolation method, the pixel value of point is interpolated. The pixel values of all the image points to be corrected in the sub-process are calculated in sequence, that is, the image is re-sampled.
3.1.2. Mass Speckle Removal
It can be seen from the characteristics of the image of the GF-4 satellite in
Section 2 that the cloud noise on the sea surface makes the original tiny targets more close to the background, which makes the detection method using the target features very difficult when using targets based on video sequences. Detection is also prone to a large number of false detections. Furthermore, from the entire sequence, the noise formed by the reflection of these sea waves and the reflection of clouds and fog also shows the phenomena of movements and flickers in the sequence, which are also reasons for the misdetection caused by the method of information difference between frames. Therefore, before the image sequence is detected, image preprocessing needs to be performed to avoid the above situation.
Through the analysis of the grayscale distribution of the
Figure 5b (as shown below), the red circle marks the grayscale of the target, showing a peak state. The cloud noise is smoothly distributed in the background. Inspired by the smoothing effect of the mean filter on the image, we pass the image through the mean filter to eliminate the peaks in the image to obtain a smooth image that is a background image similar to noise, as shown in
Figure 4. Finally, the background is subtracted from the original image, so as to achieve the effect of eliminating cloud noise, and thus completing the preprocessing of the GF-4 image. The preprocessing method can only filter out the massive cloud speckles, and the distinction between the remaining static islands and the target needs to be completed by the background modeling method.
3.2. Moving Target Detection
In 2011, Olivier Barnich and Marc Van Droogenbroeck et al. [
18] proposed a new background modeling method: The ViBe (Visual Background Extractor) algorithm. It is a general-purpose target detection algorithm, and has no specific requirements for video stream type, color space, and scene content. This algorithm introduces the random selection mechanism into the background modeling for the first time, and describes the random variability of the actual scene by randomly selecting samples to estimate the background model. By adjusting the time sub-sampling factor, very few sample values can cover all background samples, taking into account both accuracy and calculation load. Aiming at the problem that the GF-4 image sequence cannot use the feature extraction method to detect the target, the ViBe algorithm, which is good at detecting moving targets from the image through background modeling, can have a very good effect. The algorithm flow of the Vibe method is divided into three steps, similarly to the general background modeling method: Background modeling initialization, foreground extraction, and background model update. Due to the characteristics of very small targets in GF-4 satellite images, this paper also made special adjustments to the parameters in the ViBe algorithm.
The background model of the ViBe algorithm describes the background model by defining a sample set of each pixel in the image. The
K samples are expressed as:
Some popular methods, such as [
20,
21], require a sequence of tens of frames to complete the initialization of the model. ViBe initializes the background model through a single-frame image. The background model is obtained by randomly sampling the pixels around the corresponding pixel using the uniform law, so it is also called the sampling background model, which is a significant advantage of the ViBe algorithm. For the case where the number of GF-4 satellite image sequence frames is small, ViBe can adapt it well. Therefore, the model initialization is described as:
where
r and
c are the corresponding eight neighborhood points, and
i and
j are the coordinates of the specified pixel points in the image. The advantage of this method is that it can greatly shorten the background establishment time, and can also learn quickly when the background changes greatly.
The foreground extraction of the ViBe algorithm is performed by calculating the distance
(pixel value difference) between the new pixel point
and each sample value
in the sample set
. When the calculation result is less than the given threshold d, it is considered to be similar to the specified sample. When the number of samples
N is greater than
(the value range is
, generally set to 2), the pixel is considered to belong to the background; otherwise, it is judged to be the foreground. The schematic diagram is shown in
Figure 6.
Unlike traditional background modeling methods, the ViBe algorithm innovatively proposes a strategy of random selection. Pixels that conform to the background model will participate in the model update strategy. The update strategy is shown in
Figure 7. One sample in the corresponding sample set
is randomly selected and replaced with the current value of pixel
. Assuming that the learning rate is
(the reciprocal of the update probability; the general value is 2–64; the smaller the value, the faster the update), then the update strategy can be expressed (the flowchart is as follows): (a) When a pixel
is determined to be in the background, it has a probability of
to update its corresponding sample set
. When the condition is met, one of the sample values is randomly selected for replacement. (b) At the same time, there is a probability of 1/LR to update the value of its neighbors. When the probability condition is met, one of the 8 neighbor points
(
r,
c is the row and column index) is randomly selected. Then, a sample k in the sample set
corresponding to the neighbor point is randomly selected and replaced with the value of pixel
.
It is easy to see that in the ViBe model, the model is defined by only two parameters; one is the sample radius R, and the other is the minimum cardinality
. In [
18], the author mentions that ViBe is a non-parametric model, and there is no need to adjust these parameters during background extraction. however, in [
22], for more videos of different types of targets, adjusting the parameters is more suitable for the ViBe algorithm. Among these parameters, the background factor is used to update the probability of updating its model. In [
18], the parameters of the earliest version of the ViBe algorithm are set as follows: The number of samples per pixel is 20, the search radius is 20, the min index is 2, and the subsampling rate is 16. However, in our actual target detection experiment, through a large number of GF-4 image sequences, we found that because the target is very weak, the detected target result points are few. The number of samples and the value of min for each pixel should be appropriately increased, so as to increase the number of detection targets and expand the contour of the target, which is convenient for subsequent target positioning and centroid calculation.
3.3. Georadiological Radial Transformation
After the targets in GF-4 satellite image sequence are detected by the ViBe algorithm, the foreground in the image is extracted. The result is a binary image sequence segmented by foreground and background. So, to automatically detect the coordinates and position of the target in the actual geographic location, we need to locate the pixel coordinates of the target in the image, and then get the coordinates of the target in the world coordinate system through geographic affine transformation.
First, the Canny operator is used to detect the edges in the image sequence to obtain the target contours, and then the central moments of these contours are calculated. For an
image,
(
is the gray value of the image at point
), and its
-th geometric moment
is:
Then, we can get the coordinates of the center of gravity of the target , where , .
After obtaining the pixel coordinates of the target, the geographic location of the target is determined by geographic affine transformation. Radiation transformation uses the following formula to express the relationship between coordinates and geographic coordinates on a GeoTIFF remote sensing image:
represents the actual geographic coordinates corresponding to the coordinates on the image. For an image that goes north to south, and are equal to 0, is the width of the pixel, and is the height of the pixel. The coordinate pair represents the upper left corner coordinates of the upper left corner pixel. Through this radial transformation, we can get the geographic coordinates corresponding to all the pixels on the image.
3.4. Target Tracking and Association
In the previous section, the detection of moving targets in the image of the GF-4 satellite was introduced. However, this is just to mark the target detection in a single frame of the image sequence. The motion information of a target in the entire image sequence cannot be correlated. Here, we will introduce the CSR-DCF tracker (CSRT) to track and correlate the targets in the detected image sequence.
The MOSSE (Minimum Output Sum of Squared Error filter) algorithm is a target tracking algorithm proposed by Bolme, D.S. et al. [
23] in 2010. The algorithm uses the correlation filter (CF) technology for the first time. The basic idea of CF tracking is to design a filter template and use the template to perform correlation operations with the target candidate area. The position of the maximum output response is the target position of the current frame.The formula is as follows:
where
y represents the response output,
x represents the input image, and
w represents the filter template. The correlation theorem is used to convert the correlation into a dot product with a smaller amount of calculation:
, , and are the Fourier transforms of y, x, and r, respectively. The task of correlation filtering is to find the optimal filtering template.
To improve the robustness of the filter template, MOSSE uses multiple samples of the target as training samples to generate a better filter. MOSSE takes the minimum squared error as the objective function, and uses m samples to find the least squares solution:
The method of obtaining and is to perform random affine transformation on the tracking box (groundtruth) to obtain a series of training samples , which is generated by the Gaussian function; its peak position is at the center of . The value of the corresponding Gaussian map position after the affine transformation of the original image center is the of this sample.
The discriminative correlation filters (DCF) method is constantly updated and improved by researchers. Lukežič, A. et al. [
19] introduced the concept of channel and spatial reliability into DCF tracking in 2017, providing a learning algorithm CSR-DCF (discriminative correlation filters with channel and spatial reliability) for its efficient and seamless integration into the process of filter updating and tracking. This method has higher accuracy than MOOSE and KCF (Kernelized Correlation Filters) [
24]. It has achieved the most advanced results on VOT2016 (Visual Object Tracking Benchmark 2016), VOT2015 (Visual Object Tracking Benchmark 2016), and OTB100 (Object tracking benchmark).
For the small target in the GF-4 satellite image sequence, this paper uses the CSR-DCF method combined with the ViBe algorithm to locate and track the target, and achieved good results. This method proposes two concepts of spatial reliability and channel reliability [
19]. Spatial reliability uses the image segmentation method to calculate the binary constraint mask of the spatial domain through the background color histogram probability and the prior center. The binary mask here is similar to the mask matrix
P in CFLB (correlation filters with limited boundaries) [
25]. CSR-DCF uses the image segmentation method to select the effective tracking target area more accurately. Channel reliability is used to distinguish the weight of each channel during detection, and the weight is determined by channel learning reliability and channel detection reliability.
The single tracking iteration process of this method is divided into two parts: Localization and updating, as follows. The first is the localization step. Features are extracted from the search area centered on the target’s location as estimated in the previous time step, and they are correlated with the learned filter
. The object is located by summing the correlation responses, which are weighted by the estimated channel reliability score
. As described by Danelljan et al. [
26], the proportion is estimated by a single proportional spatial correlation filter. The filter responses of each channel are used to calculate the corresponding detection reliability values
according to the formula:
where
is the detection reliability of the
d-th channel and
is the ratio between the second and first highest non-adjacent peak in the channel response graph.
Then comes the update step. The training area is centered on the target location estimated in the localization step. The foreground and background histogram
is extracted and updated by an exponential moving average with a learning rate
(Step 2 in the Update step of Algorithm 1). The foreground histogram is extracted by the Epanechnikov kernel within the estimated object-bounding box, while the background is extracted from the neighborhood twice the size of the object. The spatial reliability map
m is constructed and the optimal filter
is calculated by optimizing the augmented Lagrangian multiplier [
27]. Each channel’s learning reliability weights
are estimated from the correlation response:
where a discriminative feature channel
produces a filter
whose output
nearly exactly fits the ideal response.
Next, the reliability weights
of the current frame are calculated according to the reliability of detection and learning:
The filter and channel reliability weights are updated by an exponential moving average with the learning rate (current frame and starting from the previous frame) (Steps 7 and 8 in the Update step of Algorithm 1).
Algorithm 1 The CSR-DCF tracking algorithm. |
Require: |
Image object position on previous frame , scale , filter , color histograms , channel reliability . |
Ensure: |
Position , scale , and updated models. |
Localization and estimation:- 1:
New target location : Position of the maximum in correlation between and image patch features f extracted on position and weighted by the channel reliability scores w. - 2:
Using per-channel responses, estimate detection reliability . - 3:
Using location , estimate new scale .
|
Update:- 1:
Extract foreground and background histograms , . - 2:
Update foreground and background histograms . - 3:
Estimate reliability map m. - 4:
Estimate new filter using m. - 5:
Estimate learning channel reliability from h. - 6:
Calculate channel reliability . - 7:
Update filter . - 8:
Update channel reliability .
|