Abstract

The application of ship detection for assistant intelligent ship navigation has stringent requirements for the model’s detection speed and accuracy. In response to this problem, this study uses an improved YOLO-V4 detection model (ShipYOLO) to detect ships. Compared to YOLO-V4, the model has three main improvements. Firstly, the backbone network (CSPDarknet) of YOLO-V4 is optimized. In the training process, the 3  3 convolution, 1  1 convolution, and identity parallel mode are used to replace the original feature extraction component (ResUnit) and more features are extracted. In the inference process, the branch parameters are combined to form a new backbone network named RCSPDarknet, which improves the inference speed of the model while improving the accuracy. Secondly, in order to solve the problem of missed detection of the small-scale ships, we designed a new amplified receptive field module named DSPP with dilated convolution and Max-Pooling, which improves the model’s acquisition of small-scale ship spatial information and robustness of ship target space displacement. Finally, we use the attention mechanism and Resnet’s shortcut idea to improve the feature pyramid structure (PAFPN) of YOLO-V4 and get a new feature pyramid structure named AtFPN. The structure effectively improves the model’s feature extraction effect for ships of different scales and reduces the number of model parameters, further improving the model’s inference speed and detection accuracy. In addition, we have created a ship dataset with a total of 2238 images, which is a single-category dataset. The experimental results show that ShipYOLO has the advantage of faster speed and higher accuracy even in different input sizes. Considering the input size of 320  320 on the PC equipped with NVIDIA 1080Ti GPU, the FPS and mAP@5 : 5:95 (mAP90) of ShipYOLO are increased by 23.7% and 13.6% (10.6%), respectively, with an input size of 320  320, ShipYOLO, compared to YOLO-V4.

1. Introduction

With the rapid development of deep learning in recent years, more and more deep learning techniques have been applied to intelligent ships [1, 2]. In 2020, Pan et al. proposed a fine-grained classification model RMA based on deep learning [3], which realizes the identification of navigation marks and provides accurate navigation mark information for intelligent ships. In 2021, Du et al. developed an intelligent navigation mark recognition system using deep learning technology [4], which provided an effective solution for intelligent ships. The vision system that uses computer vision technology to identify ships, navigation mark, and obstacles in the navigation environment has become an essential part of the intelligent ship perception system [5]. Therefore, an effective ship detection model is of great significance for improving the safety of intelligent ships.

There are many traditional object detection models proposed by researchers. Traditional object detection models mainly rely on region selection [6], feature extraction [7], and classifier classification [8]. In 2006, Dalal and Triggs proposed the HOG algorithm [9], which composes features by calculating and counting the histogram of the local area's gradient direction. Subsequently, Felzenszwalb et al. proposed the DPM algorithm [10], which produced corresponding excitation templates for image features and determined the target’s location according to the distribution of excitation. However, object detection will predict many redundant borders. In response to this problem, Neubeck and Van Gool proposed the NMS algorithm [11] to eliminate redundant borders. This idea is also widely used in deep learning object detection models. Traditional object detection models have limitations in many aspects, and they cannot perform image features well. The rise of deep learning in 2012 has had a massive impact on many fields, and object detection is no exception. A large number of the deep neural network parameters can extract features with better robustness and semantic relevance, and the performance of the classifier is also superior. Therefore, the object detection model based on deep learning can better learn the characteristics of the image. The object detection model based on deep learning mainly exists in two forms, two-stage and one-stage. The main difference is whether to predict the position information of the object’s border and the border’s category information in one step. In 2014, Girshick et al. used the idea of combining region candidates and CNN to propose a two-stage detection model R-CNN [12], which opened the chapter of deep learning for target detection. Based on R-CNN, Girshick proposed Fast R-CNN [13] to realize the end-to-end detection and convolution sharing function. In 2015, Ren et al. proposed the faster R-CNN [14] object detection model. The anchor frame idea and the region proposal network are designed, which significantly improves the R-CNN series of model’ detection accuracy and won many firsts in the LSVRV and COCO competitions. In 2018, Redmon and Farhadi proposed YOLO-V3 [15], which added many excellent ideas to the network, such as residual ideas [16], multilayer feature maps [17], and no pooling layer. While ensuring the detection speed of the YOLO series, the detection accuracy of the model is improved. With the continuous improvement of deep learning technology, more and more methods are proposed to enhance object detection accuracy from different angles. In 2020, in order to improve the detection effect of analog instruments, Huang et al. proposed an improved YOLO-V3 algorithm in the robot-based detection process [18], which can effectively locate the instrument and has a good detection effect. In 2020, based on the original YOLO-V3, Bochkovskiy et al. integrated the excellent optimization strategies in the CNN field in recent years, including data processing, backbone network, network training, activation function, and loss function, and proposed a better one-stage object detection model YOLO-V4 [19]. Compared with YOLO-V3, the YOLO-V4 model uses a richer data enhancement method, including Mosaic data enhancement and SAT self-antagonism training. On the basis of the backbone network of YOLO-V3, the Mish activation function and the idea of CSPNet are introduced to increase the feature extraction effect of the backbone network. The SPP module is added behind the backbone network to further increase the receptive field of the model and further improve the detection effect.

Similarly, many ship detection models based on deep learning have been proposed by researchers. Like general object detection models, the ship detection model also has two-stage and one-stage forms. Li et al. proposed a SAR image ship detection model based on improved faster R-CNN [20]. As a two-stage detector, although the original detection accuracy is improved, the proposal filtering and ROI pooling operations limit the speed of the model, and it is difficult to achieve real-time detection. Wang et al. studied the application of SSD object detection model in ship detection under complex background [21] and used transfer learning technology to improve detection accuracy and overall performance. However, the single feature extraction network and FPN structure did not fully consider the small-scale ship’s detection. Chen et al. used the attention mechanism to propose an improved YOLO-V3 (ImYOLO-V3) [22], and embedding the attention module into YOLO-V3 effectively improved the accuracy of detection, but there is no further optimization of the speed of the model. Jie et al. introduced the K-means clustering algorithm and soft nonmaximum suppression algorithm to optimize YOLO-V3 to make it more suitable for the ship scene [23], but the improved method proposed by it belongs to the engineering tuning technology, and there is no solution to the accuracy problem of ship detection from the perspective of model construction. Shan et al. combined camera and inertial sensor data and proposed a new marine target detection algorithm based on camera motion posture [24]. This algorithm uses the ideas of area candidate and edge detection to optimize the detection algorithm and improve the accuracy of ship detection. However, the traditional image enhancement method is still used, and its detection rate does not meet the requirements of the actual scene of the intelligent ship. In 2020, Li et al. proposed an improved ship detection algorithm LSDM based on YOLO-V3 and Densnet [25], which reduced the model parameters to 1/3 of the original YOLO-V3, but its backbone network uses a large number of densely connected structures. This design still affects the inference speed of the model.

In summary, the current ship detection models still has the problems of poor detection speed and missed detection of small-scale ships. First of all, in order to improve the detection speed and make the ship detection model achieve real-time effects, and this paper optimizes the backbone network of YOLO-V4. While ensuring the accuracy, the parameters of the model are reduced, and the inference speed of the model is effectively improved. Secondly, in order to solve the problem of missed detection of small-scale ships, this paper designs a new amplified receptive field module and combines the attention mechanism to optimize the original feature pyramid of YOLO-V4, which effectively improves the detection effect of small-scale ships. In the end, we get ShipYOLO, a faster and more accurate model for ship detection.

2. Methods

The YOLO-V4 model consists of a backbone network (CSPDarknet53), a receptive field amplification module (SPP), a feature pyramid (PAFPN), and a detection head (YOLOhead) (see Figure 1). The backbone network (CSPDarknet53) uses the CSP module composed of ResUnit components as the feature extraction part of the overall structure. The receptive field amplification module (SPP) uses pooling layers of different sizes to fuse features of different scales to amplify the receptive field. The Feature Pyramid Module (PAFPN) refers to PANet and obtains a two-layer pyramid structure. Although YOLO-V4 has good detection results overall, it has not been effectively designed for ship detection, so this paper has made targeted improvements to YOLO-V4.

2.1. Backbone Network Based on Structured Reparameterization (RCSPDarknet)

The original ResUnit component of YOLO-V4 [16] is a typical multibranch structure (see in Figure 2(a)), and CBM_N is composed of N  N convolution, batch normalization, and activation function (Mish) in series. The calculation formula of the ResUnit component is shown as

Although the multibranch topology has a good feature extraction effect, each branch’s results need to be retained until superimposed or connected, which significantly increases memory usage and seriously affects the model’s inference speed. Such a structure is very unfriendly to the ship detection field with high inference speed requirements. Therefore, removing the branch structure in the model can effectively improve the inference speed of the model. For example, the classic single-line model VGG [26], composed of multiple 3  3 convolution, although it has obvious advantages in speed, the accuracy is far inferior to the ResNet structure. Therefore, this study refers to the idea of RepVgg [27] and uses the structure reparameterization technology to construct the feature extraction component RepUnit (see in Figures 2(b) and 2(c)). Although the multibranch structure has poor inference speed, this structure is more conducive to model training and feature extraction. Therefore, in order to achieve both speed and accuracy improvements, this paper first uses a multibranch structure for training the calculation formula is as follows:

Then, use the structure reparameterization technology to fuse the model parameters and convert a training block into a single 3  3 convolution layer for inference. The final calculation formula in the inference stage is shown as

While ensuring the accuracy of the model, it effectively improves the inference speed of the model.

The structure reparameterization process and the calculation process of the convolution kernel are shown in Figure 3. First, the convolution layer and the batch normalization layer in the residual block are fused (this operation is performed in the inference stage of many deep learning frameworks), and the calculation formula iswhere Wi is the convolutional layer parameters before calculation, βi is the convolutional layer bias before convolution, μi is the mean value of the batch normalization layer, and σi is the variance of the batch normalization layer.

Branch (a) directly executes the fusion of the 3  3 convolution layer and the batch normalization layer, Branch (b) executes the fusion of the 1  1 convolution layer and the batch normalization layer, Branch (c) first sets a 3  3 convolution layer with a weight of 1 and then executes the fusion of the 3  3 convolution layer and the batch normalization layer (because this branch does not change the value of the input feature map, it is set to a 3  3 convolution layer with a weight value of 1, and then, it will maintain the original value after multiplying with the input feature map). Then, convert the convolution layer after branch (b) fusion into a 3  3 convolution layer (the value in the 1  1 convolution kernel is used as the center point of the 3  3 convolution kernel, and the other places are filled with 1). Finally, the 3  3 convolution layer in each branch are merged, and the weights and biases of all the branches are superimposed to obtain a 3  3 network layer after fusion.

In the end, we used the improved feature extraction component (RepUnit) to form a new module (RCSP) and got a new backbone network (RCSPDarknet), which effectively improved the model’s inference speed and had a better detection effect.

2.2. Spatial Pyramid Pooling Module Based on Dilated Convolution (DSPP)

YOLO-V4 was inspired by SPPNet [28] and added the SPP module (see in Figure 4(b)), CBL_N is composed of N  N convolution, batch normalization, and activation function (Leaky) in series (the difference from CBM_N is that they use different activation functions. In CBM_N, the activation function uses Mish and CBL_N uses Leaky),and MaxPool_N is the Max-Pooling layer whose kernel size is equal to N. The pooling operation of fixed blocks is used to stitch together different feature maps to realize the fusion of features of different sizes, which effectively improves the detection effect of images with significant differences in target size and increases the receptive field. However, ship sizes are different for ship detection, and the problem of missed small-scale ships is serious. The original SPP structure and traditional convolution structure are difficult to increase the receptive field while capturing small-scale targets in space. Luo et al. studied the problem of receptive fields in deep convolution networks [29] and pointed out that pixels in the center of the receptive fields are greater. In the forward pass process, the center pixel has more paths to transmit the pixel information to the neural node, and the edge pixels have fewer paths to transmit its pixel information to the neural node. Similarly, in the backward pass process, the receptive field’s center pixel obtains more gradients from the corresponding neural nodes. The design of dilated convolution [30] can reduce the loss of spatial features without reducing the receptive field compared with ordinary convolution and can effectively consider the feature extraction of targets of different scales. Therefore, this paper refers to the dilated convolution and SPPNet, which designs a new feature enhancement module (DSPP). While increasing the model’s receptive field, it improves its feature extraction effect for small-scale targets in space and effectively solves the problem of missing small-scale ships and improving ship detection accuracy. The DSPP structure is shown in Figure 4(a). DBL_N is composed of the dilated convolution with a spatial interval span of N, the batch normalization layer, and the activation function (Leaky) in series.

Firstly, the feature map is passed through a 1  1 convolution layer to reduce the number of channels and then divided into three branches. The three branches are composed of the Max-Pooling layer, convolution layer, and dilated convolution layer in series (the number of convolution kernels of each branch, the number of dilated convolution rates, and the kernel size of Max-Pooling are shown in Figure 4), and the last branch uses two 3  3 convolutions instead of 5  5 convolutions, reducing the parameters and deepening the nonlinear layer. Secondly, contact the feature maps of the three branches together and then connect to a 1  1 convolution layer which is used for the scale conversion feature. Finally, referring to the residual structure of ResNet, we get the feature enhancement module DSPP.

2.3. Feature Pyramid Based on Attention Mechanism (AtFPN)

In deep learning, the fusion of different scales’ features is an essential means to improve performance, and convolution layers learn semantic features of different levels of different depths. The FPN [31] structure proposed by Anthimopoulos simultaneously uses the high-resolution of low-level features and the high-semantic information of high-level features and achieves the prediction effect by fusing these features of different layers. YOLO-V4 effectively referred to this idea and combined with PANet [32] to add a bottom-up feature pyramid after the FPN layer to obtain PAFPN (see in Figure 1). This structure utilizes robust semantic features from the top to the bottom and strong positioning features from the bottom to the top and aggregates parameters from different backbone layers to different detection layers. However, in the PAFPN structure of YOLO-V4, the same convolution module as the backbone network is still used. Although it has a good feature extraction effect, it brings the problem of excessive parameter volume. Therefore, this paper refers to the CBAM structure, merges it with PAFPN, and adds a residual structure design to each semantic layer. We are obtaining a new feature pyramid structure (AtFPN), which effectively improves the model’s accuracy and reduces the number of model parameters.

CBAM [33] was proposed by Woo et al. (see in Figure 5(a)). This structure provides attention maps from the channel and spatial dimensions, respectively, and is used for the middle feature map, which can effectively help the network’s information. The channel attention module aims to focus on what features are meaningful. Firstly, the channel attention module compresses the spatial dimension of the input feature map, uses the Avg-Pooling layer and the Max-Pooling layer, obtains the global context information in the feature map while reducing the interference information in the feature map, then forwards it to a shared network (single-layer perceptron), and finally generates the channel attention map through sigmoid (see in Figure 5(b)). The calculation formula is shown as

Spatial attention is complementary to channel attention and aims to assign weights to feature maps to obtain spatially interesting areas (see in Figure 5(c)). Firstly, Avg-Pooling and Max-Pooling operations are applied along the channel axis, and they are connected to generate effective feature descriptors, and then, the spatial attention map is obtained through sigmoid. The calculation formula is as follows:

Therefore, we replaced the original YOLO-V4 bottom-up semantic layer CBL component with a CBAM component, which effectively reduced the model parameters and increased the target region parameters’ weight to be identified in the feature map at different scales. At the same time, we once again referred to the residual structure of ResNet in each semantic layer and fused the corresponding pixels of the shallow feature map output by the backbone network and the deep feature map after multilayer convolution to enhance the variety of feature map. The AtFPN designed in this paper is shown in Figure 6.

2.4. ShipYOLO

In summary, this paper designs ShipYOLO, a detection model that is more suitable for the ship field, and the model structure is shown in Figure 7. Firstly, an efficient backbone network RCSPDarknet is designed using the structure reparameterization technology, which effectively solves the current problem of low real-time performance in ship detection. Secondly, the feature enhancement module DSPP is designed using dilated convolution and Max-Pooling and combined with the feature pyramid structure AtFPN based on the attention mechanism, while ensuring the model inference speed, and it further improves the model accuracy and effectively solves the problem of small-scale ship missed inspection existing in the current ship detection model.

3. Experiments and Results

3.1. Evaluating Indicator

This paper uses mAP as the model’s accuracy evaluation indicator, where mAP@5 : 5:95 represents the average mAP at different IOU thresholds (from 0.05 to 0.95 and step size is 0.05). The mAP50 and mAP90 score tables represent mAP at IOU thresholds of 0.5 and 0.9. The mAP (small) represents the average mAP of small objects. FPS represents the number of frames transmitted per second. #Params represents the parameter amount of the model. For the convolutional layer, the calculation formula is shown aswhere Co is the number of output channels, Ci is the number of input channels, is the width of the convolutional kernal, kh is the length of convolutional kernal, and 1 is the bias of convolutional kernal.

For the fully connected layer, the calculation formula is shown aswhere m is the output vector dimension of the fully connected layer, n is the input vector dimension of the fully connected layer, and 1 is the bias of the fully connected layer.

3.2. Dataset
3.2.1. Public Ship Dataset

In 2018, Shao et al. created a public ship dataset (SeaShips) [34], which currently consists of 31,455 pictures, covering 6 common ship types (ore carrier, bulk cargo carrier, general cargo ship, container ship, fishing boat, and passenger ship). Part of the data is shown in Figure 8. For this dataset, we divided it according to the ratio of 8 : 2 and produced a training set and a validation set.

3.2.2. Self-Built Ship Dataset

In order to meet a richer scene, we have produced a ship dataset in a natural scene, a total of 2238 sheets, of which the category is a single category (Ship), and some of the dataset are shown in Figure 9. Similarly, we divided it according to the ratio of 8 : 2 and produced a training set and a validation set.

3.3. Experiment and Result

We conducted experiments in a 1080Ti environment. First, we used our three optimization strategies to conduct experiments on the basis of YOLO-V4. Using our self-built dataset with an input size of 512  512, the experimental results are shown in Table 1:

From Table 1, we can see that RCSPDarknet can significantly improve the inference speed of the model while maintaining the accuracy and reduce the amount of parameters. Embedding the DSPP module into YOLO-V4 can effectively improve the detection accuracy of the model, but the inference speed of the model is slower than that of YOLO-V4. Finally, the model parameters of YOLO-V4 based on AtFPN have been reduced, and the detection accuracy and inference speed have not been affected.

Finally, we compared the performance of ShipYOLO, YOLO-V4, and YOLO-V3 in the two datasets and tested the detection accuracy of three models for small targets. The experimental results are shown in Tables 24:

From Tables 2 and 3, we can see that YOLO-V4 has better detection accuracy than YOLO-V3 at input sizes of 416  416 and 512  512, while YOLO-V3 has better detection accuracy than YOLO-V4 at input sizes of 320  320, but YOLO-V4 has a faster inference speed. Through comparison and verification, the ship detection model ShipYOLO proposed in this paper is better than YOLO-V4 and YOLO-V3 in accuracy, FPS, and #Params. With an input size of 320  320, compared to YOLO-V4, ShipYOLO has a 13.6% increase in mAP@5 : 5:95 (10.6% mAP90), a 23.7% increase in FPS, and the model #Params reduced to 188 m. From Table 4, we can also find that our ShipYOLO has a better detection effect in the detection of small-scale ships.

We also screened some typical pictures for verification. As shown in Figures 10 and 11, it can be seen that ShipYOLO has solved the small-scale ship missed inspection problem of YOLO-V4 and YOLO-V3 and has a better bounding box regression effect. The comparison of Figures 10 and 11 and the experimental data in Tables 1 and 2 proves the effectiveness of ShipYOLO in the field of ship detection. It is a faster and more accurate ship detection model.

4. Conclusions

This paper proposes an enhanced model based on YOLO-V4 for ship detection. First of all, this paper uses structured reparameterization technology to optimize the backbone network. The new backbone network increases the model’s inference speed, and effectively solves the problem of poor ship detection model speed. Secondly, this paper redesigns the amplified receptive field module of YOLO-V4 and optimizes the feature pyramid structure based on the attention mechanism. These structures improve the model’s detection effect for small-scale ships and solve the problem of missed inspection of small-scale ships. Extensive experimental results show that our detection model ShipYOLO has a significant improvement in speed and accuracy compared to the current advanced detection models and can be effectively applied to the field of ship detection. Through experiments, we have also found that extreme weather conditions such as foggy weather and rainy days during the ship’s navigation seriously affect the model’s recognition of the ship. Therefore, we will also do more research in this section so that ships can be effectively identified in more complex environments.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant 3132019400.