1 Introduction

In recent years, the demand for automated surveillance systems has grown rapidly. This is mainly due to the continuous decrease of the costs of cameras and sensors, leading to broadly available video material and the inefficiency and high labor costs to process this enormous amount of data by humans. To alleviate this, numerous algorithms have been proposed to automate the analysis of video surveillance material and the subsequent alerting for specific various events and dangerous situations. The automated analysis of video surveillance involves the automated detection of objects and their classification. These objects-of-interest include people, vehicles and maritime vessels. However, it is clear that regardless of the target entity, creating a continuous visual coverage of a large physical space is not feasible. In other words, the camera field-of-views are inevitably sparse compared to the full area where objects-of-interest may be present. As a result, intelligent surveillance systems need to find the correspondences between the identities of objects within multiple disjoint camera views. This makes the re-identification (re-ID) of specific objects and their identities one of the most important steps to achieve fully-automated surveillance and higher levels of event analysis.

Today, most security applications benefit from revealing the motion history of objects across disjoint camera views. This valuable information can be used to search for an object-of-interest in a larger database, to detect trajectory-based anomalies and to reveal inter-camera trajectories for observation, surveillance and flow statistics. In recent years, due to its prominent use-cases, the problem of re-ID increasingly attracted scientific attention. Starting from the area of person re-ID, the research has been extended to cover the problems of vehicle and maritime vessel re-ID, where different environments and added difficulties render the re-deployment of the person re-ID algorithms ineffective. As a result, specialized approaches to the maritime vessel re-ID are required and since maritime vessels are infamous tools for transporting illegal goods [1], piracy [27] and illegal fishing [7, 30], would make an automated system for maritime vessel surveillance highly attractive for law enforcement.

Similar to person and vehicle re-ID in traffic on public roads, maritime vessel re-ID is a difficult problem because of the associated challenges. Difficulties inherent to the re-ID task, such as occlusions, viewpoint variations, low-resolution images, target object similarities and variable lighting/weather conditions do also occur in maritime vessel re-ID. Furthermore, there are additional, specific difficulties that are unique to the maritime vessel re-ID. For example, a large variability of aspect ratio with changing pose and large distances between the camera and the target vessel introduce new challenges. However, thanks to modern computer vision techniques, faster hardware and deep learning, developing a practical, real-time re-ID system for automated surveillance is now possible.

In order to exploit and take advantage of deep learning algorithms, an excessive amount of labeled data is needed. However, to the best of our knowledge, only a few medium-scale datasets are publicly available to researchers for the problem of maritime vessel re-ID. To address this issue and in this study, we present a new maritime vessel re-ID dataset to promote further research. Our new dataset is named VR-VCA (Vessel Re-identification-Video Coding and Architectures Research Group). It includes 729 unique maritime vessel identities, pre-labeled for their vessel-type, orientation and identity. A total of 4,614 images are available in the dataset.

In order to assess the difficulty level of our dataset and to provide a baseline for further studies, we evaluate 6 different architectures with 2 different loss combinations and training settings. This quantitative evaluation of existing approaches reveals efficient training and network design strategies for vessel re-ID and provides further studies with useful basic design principles.

Lately, multi-branch architectures have gained importance in the fields of person and vehicle re-ID. State-of-the-art algorithms are now using spatially and/or channel-wise partitioned and pooled feature maps and train those individual branches with separate losses. Such an approach is shown to improve the re-ID accuracy significantly, since it takes advantage of combining the local features with global features in an end-to-end trained, deep learning framework. Motivated by the superior performance of branched networks in multiple image retrieval fields, we propose a new architecture that is carefully designed to address the problem of vessel re-ID. We summarize and present our contributions below.

  • Large-scale maritime vessel re-ID dataset with annotated vessel orientation and vessel-type labels has been collected and created. This dataset is made publicly availableFootnote 1.

  • Comprehensive performance evaluation of deep learning architectures on the new dataset, to constitute a strong baseline.

  • Novel, multi-branch architecture that is carefully designed to solve the difficulties of the maritime vessel re-ID problem.

The remainder is structured as follows. Section 2 reviews the person, vehicle and maritime vessel re-ID methods from literature. In Sect. 3, we propose a multi-branch deep learning method in detail. Section 4 introduces our new dataset and presents statistical information on its content. Section 5 discusses a quantitative performance evaluation of our method and selected baseline architectures. Lastly, in Sect. 6, concluding remarks are given.

2 Related work

To review the existing approaches to maritime vessel re-ID, we first briefly discuss the person and vehicle re-ID algorithms.

2.1 Person and vehicle re-identification

Person re-identification Person re-ID is the task of people identity matching across non-overlapping camera views. To the best of our knowledge, the problem of re-ID in general has been first introduced in [34], in the context of person re-ID. In this work, authors assign a latent label for each person and define a probabilistic relation between the labels and features. Then, the re-ID problem is solved by finding the posterior label distributions. Following [34], the field received immense scientific attention. In most studies, a fixed set of features are extracted from each of the person images, followed by the calculation of a mathematical measure of distance for each feature pair. Then, from the database of known people (gallery), likely matches that have small distances to an image under consideration (query) are retrieved. This methodology forms two main directions for person re-ID research. On one hand, researchers focused on developing better feature extraction approaches that are suitable for the problem. On the other hand, studies aimed to develop better distance metrics to yield better ranking.

In early feature extraction studies, a wide variety of different handcrafted features are proposed to address the problem. In [10], the authors use spatiotemporal over-segmentation to determine viewpoint invariant regions. Then, they compute a feature vector that uses color and structural information. In [2], Bazzani et al. introduce Symmetry-Driven Accumulation of Local Features (SDALF). SDALF features are extracted by first applying foreground-background segmentation on person bounding boxes, followed by a silhouette partitioning that identifies salient regions. Finally, the color and texture features that are extracted from each salient region are combined into a single feature vector. In [6], the authors apply a detector based on Histogram of Oriented Gradients (HOG) to locate the full body and other semantically meaningful partitions such as the top, the torso, legs, the left arm and the right arm of each pedestrian. Then, the covariance descriptor is extracted from each region, considering the position, color and gradients. Lastly, authors employ a pyramid matching scheme with multi-granularity features to compute distances between the detected people. In [16] and [12], scale-invariant feature transform (SIFT), and its modification, the speeded-up robust features (SURF) are used to characterize the person bounding boxes. Other popular feature extraction methods that have been used to solve the problem of person re-ID include maximally stable color regions (MSCR) [25], local binary patterns (LBP) [15] and local maximal occurrence (LOMO) [23].

Besides the feature extraction-based approaches to person re-ID, metric learning methods have also received immense scientific attention. In such methods, the aim is to find a transformation of the feature space, such that the transformed feature space has better separation of different identity clusters. Such approaches typically attempt to minimize the intra-class variance of each identity, while maximizing the separation of different identities. For instance, in [20], the authors derive a metric learning method by exploiting an equivalence constraint in an efficient formulation. In [37], Zheng et al. introduce probabilistic relative distance comparison (PRDC) model that aims to maximize the probability of a matching pair, having a smaller distance than that of a non-matching pair. In [28], the authors reformulate the challenge as a ranking problem and learn a subspace where the potential true match is assigned the highest ranking. This approach effectively transforms the problem into a relative ranking problem, instead of an absolute scoring problem. In XQDA [23], Liao et al. formulate to learn a discriminant low-dimensional subspace by cross-view quadratic discriminant analysis.

Following the emergence of deep learning in 2012, most of the recent person re-ID research now utilizes deep models to solve the problem. Learning feature extraction and suitable distance metrics simultaneously from the available large-scale datasets, deep learning solutions to the person re-ID provide high accuracy and reasonable computational cost thanks to powerful modern hardware. This opens up new possibilities for the practical applications of re-ID, especially in the field of surveillance.

To take advantage of the potential of deep learning, numerous methods were proposed. Hermans et al. [14] introduce a mini-batch construction strategy that is fine-tailored for the use of triplet loss. Authors propose to use only the hardest positive and negative sample for a given anchor image in a carefully sampled mini-batch to improve the performance. In [22], authors propose a harmonious attention module that takes advantage of determining the regions-of-interest of a given person bounding box sample. In [5], authors derive a new loss function called the quadruplet loss. In this method, three images for each anchor image are sampled from the dataset, two of which are negative (different identity). In [38], authors propose a generative adversarial network (GAN) to enhance the training dataset with artificially generated data. In [17], Kalayeh et al. use a two-stream deep architecture, where one of the streams generates masks for semantically meaningful body partitions, and the other extracts features. Then, the final feature vector for a given bounding-box image is constructed by combining the feature vectors for each semantic region. In [31], Su et al. take advantage of the pose information to enhance the performance of re-ID. In this method, authors utilize the pose estimations to partition the bounding-box image, and learn robust global and local feature representations. In [32], authors use triplet and softmax losses in a multi-branch architecture called MGN. In this method, each branch divides the intermediate feature volume into multiple volumes before collapsing the spatial dimensions with pooling. Then, each divided part is trained with an independent loss to yield better feature extraction. In [26], authors first extract convolutional neural network (CNN) features for each person bounding box in a time sequence. Then, a Recurrent Neural Network (RNN) is used to combine the feature vectors of individual frames into one feature vector to be used for re-ID. Numerous other person re-ID methods were proposed for the problem. For further reading on the problem of person re-ID, the reader is referred to the surveys in [3] and [36].

Vehicle re-identification The problem of Vehicle re-ID is the task of identity matching of vehicles across disjoint cameras. Recently, this problem attracted increasing scientific attention, due to its valuable applications in the fields of surveillance and traffic-flow analysis. The vehicle re-ID problem includes additional challenges, such as motion blur, varying aspect ratio of bounding boxes, reflective surfaces of vehicles and only subtle differences between different identities with similar model/make/year.

Although the vehicle re-ID is a relatively new problem compared to its person variant, since research could take advantage of the already mature person re-ID literature, the performance has grown significantly in a short time. For instance, in [33], authors generate orientation-based region proposals to refine the global CNN features with respect to the viewpoint. In [35], Zapletal et al. first extract 3D bounding-box information of a given bounding-box image of a vehicle. Then, the image is normalized by mapping different visible sides of the vehicle into a fixed spatial location. In [24], authors introduce a multi-branch architecture called region-aware deep model (RAM). In this architecture, multiple branches aim to extract better features by using different strategies, such as spatial feature volume partitioning, attribute learning and batch normalization. Similarly, in [4], authors propose a two-branch architecture that, in addition to spatial partitioning, employs partitioning of intermediate feature volumes in the channel dimension. In [21], authors discuss various mini-batch sampling strategies for the triplet loss. Further, Kumar et al. also comparatively evaluate the contrastive and triplet losses. For a detailed review of vehicle re-ID methods, the reader is kindly referred to [18].

Maritime vessel re-identification Compared to its person and vehicle variants, maritime vessel re-ID is a relatively new problem. In addition to the already challenging problems of person and vehicle re-ID, maritime vessel re-ID introduces additional challenges, such as low-resolution images due to the size of the vessel and imaging distance, and high variability of the bounding-box aspect ratios with the viewpoint.

Up to this point, the maritime vessel re-ID problem has received only fractional scientific attention compared to its person and vehicle variants. In [9], authors propose an architecture called the identity-oriented re-identification network that combines the triplet loss and softmax cross-entropy (CE) loss with a ResNet50 [13] architecture. In [11], authors base their work on [14] and extend the method with various multi-query strategies. In [29], authors introduce a new dataset, as well as a novel approach that employs global-and-local fusion-based discriminative feature learning. This method combines CE loss with a novel, orientation-guided quintuplet loss and performs multi-view representation learning for re-ID.

We conjecture that the field of maritime vessel re-ID suffers from the lack of a widely-adopted, large-scale dataset. To alleviate this, we introduce a new dataset called VR-VCA. In accordance with the current trends in re-ID literature, we also provide the viewpoint and vessel-type labels for each sample to promote further research. The detailed information and baseline analysis of our dataset is included in Sect. 4. Further, we also propose a novel, deep learning-based, multi-branch architecture in Sect. 3.

3 Maritime vessel re-identification

This section presents the proposed MVR-net method. In the following subsections, we provide an overview of the proposed method, after which we explain each element of the architecture in detail.

3.1 Architecture overview

Figure 1 illustrates the architecture of the MVR-net. The proposed network receives a mini-batch of labeled vessel images as input. Then, it extracts a feature embedding for the input images using a backbone feature-extraction network. Afterwards, the embedding is passed through three parallel convolutional branches. Each of these branches are carefully designed to further discriminate the extracted feature embedding and generate a more indicative representation of input images in feature space. Typical usage of branches for specific features are the processing of height, width, and channel information. This type of architecture is inspired by recent multi-branch methods like MGN [32] and PRN [4], which show significant improvement in re-identification performance for pedestrians and vehicles, respectively. The MGN network is an example that uses height as a guiding discriminating feature, while PRN employs height, width, and channel as discriminating features. The proposed MVR-net also exploits those three branches and uses a combination of a triplet and a softmax loss function to calculate the gradients required in the training procedure.

Fig. 1
figure 1

Architecture of the proposed MVR-net

3.2 MVR-net description

This subsection explains each part of the proposed MVR-net in detail, as illustrated in Fig. 1.

Pre-processing of input images This work is based on the VR-VCA dataset for training. In this dataset, the majority of vessel samples are captured with cameras deployed on shorelines and therefore possess horizontally oriented structures. Therefore, the MVR-net first downsamples the training images into a size of \(128 \times 384\) pixels (height \(\times \) width). Then, it applies conventional standardization and augmentation techniques (e.g. normalizing, random-horizontal flipping, and random erasing [40]) on the resized input images. At the beginning of each training epoch, MVR-net randomly picks K pre-processed samples per individual vessel to generate an input batch. The pre-processing operations on input images and the batch generation are not depicted in the network architecture diagram.

Mini-batch generation To benefit from the fast training and gradient smoothness advantages, we employ mini-batch gradient training with batch-hard sampling strategy [14]. For this, the second step in each epoch is to divide the generated batch into mini-batches of N samples. Each mini-batch will include P unique IDs, and has a size of N/K. This form of mini-batch generation is adopted, since our network uses triplet loss with hard batch mining as part of its total cost function. Finally, all mini-batches are supplied into the backbone network for training iterations.

Backbone network Generally, CNN-based re-identification methods extract features for input samples and further process these features to verify if a query sample belongs to a specific identity from the gallery database, based on feature similarities. To this end, a CNN network is required to extract the feature representations. In order to facilitate the process, an initial set of layers from reliable classification CNNs are frequently employed to generate the coarse portion of the required feature representation. This will be followed by re-identification-oriented layers, designed specifically for the intended application.

One of the most frequently used backbones in the re-identification literature is ResNet50 [13]. This network is well known for its robust performance against the vanishing gradients problem. Therefore, we employ ResNet50 as our backbone network to extract the feature maps for each image of the mini-batch. Then, we duplicate all the extracted feature maps at the end of the Conv4_1 layer (see [13] for details) three times, and feed these features separately into our three re-identification branches. In other words, the layers of these branches will be constructed on top of separate copies of the Conv4_1 layer of the ResNet-50, provided by the same backbone network. We will refer to each of these duplication of feature maps as a feature embedding in the remainder of the paper. Each branch will then further process its embedding to generate more meaningful features for vessel re-identification in a parallel manner.

Multi-branch design As a core contribution, MVR-net proposes a 3-branch architecture, specified to address the maritime vessel re-identification problem. As illustrated in Fig. 1, these branches are called Height branch, Width branch, and Channel branch, focusing on the height-wise, width-wise, and channel-wise structures available in their input feature embeddings.

The Height branch performs three independent partitioning operations across the height dimension of its input feature embedding. These operations generate one, two, and three separate feature volumes, respectively. Each of these volumes contain the same spatial fraction of the input feature embedding. The Width branch applies the same partitioning operations on the horizontal axis of the spatial dimensions. In both of these branches, the operation of partitioning the input feature embedding into one volume means copying the whole input feature embedding into a separate volume, which aims at maintaining the global features of the embedding for the next steps. Such copied volumes will be referred to as global volumes in the remainder of this paper (e.g. the 3D block at the left in each branch). It is also important to highlight that these two branches spatially partition the input feature embedding into “vertical” and “horizontal” volumes. Finally, the Channel branch partitions its input feature embedding into four volumes of features across the embedding depth.

After generating these 16 partitioned volumes out of the input feature embeddings, each branch passes its feature volume through a three-phase pipeline to prepare the required feature vectors for the loss calculation block. These three phases are developed as follows.

Phase A: The generated feature volumes are separately shrunk into feature vectors of size 2, 048 (for the Height and Width branches) and 512 (for the Channel branch) using global max-pooling operations.

Phase B: \(1\text {x}1\) convolutional layers are employed to equalize the vector dimensions constructed in the previous phase to a size of 256. Additionally, a batch normalization operation is also applied on the feature vectors. At this point, separate copies of the two feature vectors resulting from the global volumes of the Height and the Width branches are supplied into a triplet loss block.

Phase C: Each of the obtained 16 vectors are transferred into separate fully-connected layers. The output dimensions of these fully-connected layers are equal to the number of unique vessel identities in the training set. Then, these outputs are carried into a softmax CE loss block. MVR-net combines the triplet and the softmax CE losses to calculate the gradients required for updating the network parameters during training. These loss functions will be explained in detail in the next part of this subsection.

Loss function We employ softmax CE loss and triplet loss with the batch-hard sampling strategy as introduced in [14], to train our architecture. The batch-hard triplet loss is defined in Eq. (5) of [14] as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\text {BH}}(\theta ; X)&= \sum _{i=1}^{P} \sum _{a=1}^{K} \Bigl [ m+\max \limits _{p=1...K} D\left( f_\theta (x_a^i), f_\theta (x_p^i)\right) \\&\quad - \min _{\begin{array}{c} j=1...P\\ n=1...K\\ j\ne i \end{array}} D\left( f_\theta (x_a^i), f_\theta (x_n^j)\right) \Bigr ]_+, \end{aligned} \end{aligned}$$
(1)

where X, \(\theta \), P, K, m, \(f_\theta \) and D denote an input mini-batch, learned network weights, number of different identities in a mini-batch, number of images per identity in a mini-batch, triplet loss margin, forward inference function and distance calculation function, respectively. In addition to the triplet loss, we use the softmax CE loss for identity classification. Mathematically, softmax CE loss for a mini-batch X can be specified by:

$$\begin{aligned} {\mathcal {L}}_{\text {CE}}(\theta ; X) = -\sum _{x\in X}log\left( \frac{e^{f_\theta (x)[\text {id}(x)]}}{\sum _je^{f_\theta (x)[j]}}\right) , \end{aligned}$$
(2)

where function id(.) returns the identity of a given image. Combining the two losses, the total loss is expressed as:

$$\begin{aligned} {\mathcal {L}}_{\text {total}}(\theta ; X) = \frac{1}{T}\sum _{t=1}^{T}{\mathcal {L}}_{\text {BH}}^t(\theta ; X) + \frac{\gamma }{C}\sum _{c=1}^{C} {\mathcal {L}}_{\text {CE}}^c(\theta ; X), \end{aligned}$$
(3)

where \(\gamma \), T and C are the loss weighting parameter, the number of triplet losses and the number of softmax CE losses, respectively.

3.3 Discussion on MVR-net architecture

Motivation for Height and Width branches As mentioned above, the Height and Width branches of the MVR-net separately partition their input feature embeddings into one, two, and three volumes. This idea of partitioning the feature embeddings into several volumes is inspired by the recent state-of-the-art methods. As an example, the well-known PRN network consists of only two branches. These branches partition their input feature embedding just once, and into four spatial volumes (one branch vertically and the other branch horizontally). However, this architecture is designed for the vehicle re-identification problem. Surveying the vehicle re-identification datasets shows that car samples contain similar image characteristics, although cars are different. This is due to the observation that surveillance cameras capturing cars are installed at fixed locations on top of roads and capture the cars in similar distances from the camera lenses, yielding car samples with similar image qualities and resolutions. However, in a vessel re-identification dataset (e.g. our VR-VCA dataset) captured in maritime environments, vessels appear in a diverse settings of camera viewing angles, distances to the camera, occlusions, etc. For a maritime surveillance system, it is essential to identify the vessels as soon as they enter the receptive field. This is important for the security of harbors. Consequently, the resolution of captured vessels in our vessel-oriented dataset varies in a wide range. This variation in input images diminishes the performance of a network that processes the feature embeddings for all input samples using only one limited number of partitions (e.g. only four partitions for the PRN). This motivates why our MVR-net has a broader partitioning to cover the various resolution scenarios. This is implemented in MVR-net by grasping features using a separate partitioning into different numbers of volumes.

In order to select the optimum number of partitioning for each spatial branch, we have trained several versions of the MVR-net with different numbers of partitioning (which will be discussed later in the experimental results). After this experimental investigation, we have decided to split each input embedding separately into one, two, and three volumes in each spatial branch. With this, for the vessels located at a close distance to the camera (vessels with higher resolution), the volumes obtained by splitting the input embeddings into three spatial partitions yield more detailed features. Similarly, the two-partition volumes will extract useful information for the low-resolution vessels (e.g. those located at a far-away distance to the surveillance camera). According to our experiments, including volumes that are obtained by splitting the feature embeddings into more than three partitions (like four in PRN) diminishes the re-identification performance for vessels. This occurs most probably because a higher degree of partitioning reduces the influence of coarser features (e.g. the global features) in the final feature representation. Therefore, we can conclude that the coarser features have a high impact on the vessel re-identification problem (compared to the vehicle re-identification), especially if the vessel samples lack resolution.

Motivation for channel branch We know from neural style transferring that identifying the correlations between outputs of different filters of a convolutional layer can help to grasp the existing style in a set of images (e.g. the common style of the Van Gogh paintings). This concept is also adopted in re-identification methods. To the best of our knowledge, PRN is the first to utilize these correlations by applying channel-wise partitioning on the feature embeddings. According to the original PRN paper [4], the authors target the extraction of distinct local features by these channel-wise operations. However, these channel-wise partitionings are performed together with the spatial partitionings in the same channels. Each branch of PRN splits the input embedding into one and four spatial (one branch vertically and the other one horizontally) and four channel-wise volumes. We prefer to design a network with three branches, first one for detecting horizontal structures inside the input images (Height branch), the second one for detecting the vertical structures (Width branch), and the third one for detecting the internal correlations between different feature maps of the feature embedding.

4 Maritime vessel re-identification dataset: VR-VCA

In order to train the vessel re-identification model, we have recorded several videos at different day/year-times from various locations in the Netherlands. We have used two different cameras in our recordings. The videos contain a vast variety of viewpoints on vessels. Additionally, several vessel types with divergent sizes and distances to the cameras are represented in this dataset. Finally, challenging scenarios including vessel occlusion/truncation are also annotated. Figures 2 and  3 illustrate several examples of the VR-VCA dataset.

Fig. 2
figure 2

Four VR-VCA examples. Each row presents two samples of an individual vessel in different locations

Fig. 3
figure 3

Another four VR-VCA examples. Each row presents two samples of an individual vessel in different locations

Fig. 4
figure 4

VR-VCA representation in terms of: a vessel types, and b vessel orientations

The dataset contains a total of 4614 vessel samples from 729 unique vessel identities. Each vessel identity is represented by several samples. Additionally, we have labeled each vessel with a bounding box, its vessel type, and vessel orientation (i.e. the approximate camera viewing angle) to facilitate future research.Footnote 2. Vessel types include the following eight classes: sailing vessel, passenger ship, fishing vessel, river cargo, small boat, yacht, tug, and taxi vessel. The vessel orientations are described with the following five orientation labels: front view, front-side view, side view, back-side view, and back view. Figure 4 statistically analyzes the VR-VCA samples in terms of vessel types and orientations. Besides, we have provided the same unique ID to all samples corresponding to each specific vessel. Moreover, a label is assigned to each cropped sample showing whether the vessel is captured in its full body, or is truncated, or occluded. The dataset is split into training, gallery, and query datasets. The specifications of these datasets are as follows:

Training dataset The training dataset contains 2, 268 samples from 365 individual vessels. This dataset includes 184 sailing vessels, 442 passenger ships, 30 fishing vessels, 936 river cargos, 105 small boats, 22 yachts, 64 tugs, and 485 taxi vessels. These samples cover 170 vessels from front view, 727 vessels from front-side view, 736 vessels from side view, 556 vessels from back-side view, and 79 vessels from back view. The maximum and the minimum number of samples per unique ID are 38 and 2, respectively.

Gallery dataset The gallery dataset is comprised of 1, 667 samples from 364 unique vessel identities. The type statistics of the gallery dataset samples are as follows: 144 sailing vessels, 379 passenger ships, 13 fishing vessels, 722 river cargos, 57 small boats, 9 yachts, 29 tugs, and 314 taxi vessels. In this dataset, there are 124 front views, 569 front-side views, 498 side views, 412 back-side views, and 64 back views from vessels. The maximum and the minimum number of samples per unique ID are 26 and 1, respectively.

Query dataset The query dataset possesses 679 samples with 364 unique vessel identities. This dataset includes 68 sailing vessels, 152 passenger ships, 5 fishing vessels, 260 river cargos, 26 small boats, 5 yachts, 13 tugs, and 150 taxi vessels. These samples represent 32 vessels with front views, 157 vessels with front-side views, 243 vessels with side views, 213 vessels with back-side views, and 34 vessels with back views. In the query dataset, the maximum and the minimum number of samples per unique ID is equal to 7 and 1, respectively.

4.1 Discussion on VR-VCA characteristics

There is a fundamental difference between a vehicle re-identification dataset and our vessel re-identification dataset. In vehicle re-identification, cameras are installed at fixed locations on top or aside of roads, covering a specific background. The cars have no other way but to pass through the receptive fields of these cameras. Under such a setting, each car sample is also given a separate camera ID. While performing re-identification for each query sample (which means ranking gallery samples based on their similarities to the input query sample), it is common to discard the gallery images from the same car and the same camera as the query sample. The reason is that we need the query sample to be captured at a different camera location compared to the candidate gallery images. Otherwise, there is no need for a complex re-identification system and the task can be performed by a simple tracker. However, in a maritime environment, vessels move in arbitrary directions, practically making the fixed camera option infeasible. This problem becomes even more challenging if the maritime environment is a spacious harbor with different exit areas where vessels can maneuver easily. Such cases occur rather frequently in our dataset. Therefore, we have recorded our images by continuously chasing the moving vessels. With this, we have captured each individual vessel in different perspectives and with different backgrounds. Hence, we have discarded this camera ID concept in our dataset, since all our samples have a different camera and varying background settings. In other words, we have considered all our samples to virtually possess a different camera ID. It is possible that some samples have similar backgrounds, which comes from the fact that the vessels with the same identity are always captured in the same neighborhood.

Besides this difference to vehicle re-identification datasets, there is a resemblance too. In vehicle re-identification, there is always a possibility that cars with similar appearances (model, color, etc.) pass through views of the same cameras. Therefore, a vehicle re-identification dataset is generally containing samples with different identities, but with almost the same appearance (although these samples may slightly differ because of tiny stickers or human passengers or other details). This scenario becomes even more challenging in our vessel re-identification dataset captured in city/river-type harbors. Only a limited number of vessel models appear in such environments, and consequently the probability and frequency of finding vessels with the same appearance is much higher than for cars. For example, in a considered scenario, a harbor is located next to a wide river passing through a city and passenger vessels of the same model belonging to a specific company are continuously transferring people across the river. Thus, another aspect making our dataset challenging is the problem of having appearance-wise overlap between the vessel samples.

Last but not least, vessel samples of the VR-VCA dataset are cropped from outdoor surveillance images, having different sizes and aspect ratios. Therefore, CNN-based re-ID systems need to resize them into a fixed footage in a data pre-processing stage (based on system requirements and architectural design). However, vessels appear with more divergent aspect ratios in image frames, compared to other conventional re-ID targets (e.g. pedestrians with mostly vertical and cars with mostly squared-shape bounding boxes). Thus, the vessel samples of the VR-VCA dataset vary from very narrow- yet long- (horizontally or vertically) shaped samples to quite square-shaped ones, depending on their types and orientations. This implies a need for decision making on the proper size for the input samples, making VR-VCA an even more challenging dataset. This challenge may not hold for other vessel re-identification datasets, since we deliberately capture videos such that yields vessels in a divergent range of resolutions and aspect ratios (by covering vessels also in far-away locations from different viewpoints).

5 Empirical validation

This section presents the empirical evaluations and discussions on the outcomes in the following subsections.

5.1 VR-VCA performance analysis on baselines

This subsection benchmarks the VR-VCA data by testing several re-ID methods on that. For choosing the methods for comparison, we have adopted models that vary in terms of complexity by having a different network architecture and a different number of network layers. Table 1 presents the performance of the selected models on VR-VCA. The table compares six baseline models with three different losses, both with and without re-ranking technique. First, the implementation process of designing the deep learning models is explained. Afterwards, the results are analyzed.

Implementation details for baseline methods In order to match each baseline architecture to the vessel re-ID problem, we have adapted the architectures to optimize their performances for this problem. The chosen optimization measures are generic and apply to all baseline architectures, so that each baseline architecture is modified in the same way for a fair comparison.

The following measures have been implemented. (1) We have substituted the fully-connected classifier and pooling layers at the end of each network, and added a max-pooling layer to incorporate both spatial dimensions in concentrated form. (2) The latter layer produces a fixed-size feature vector which is then used for the triplet loss training. (3) Additionally, if softmax CE loss is employed during training, we have added a fully-connected layer at the end that outputs the class probabilities for each vessel identity. (4) We have employed re-ranking to improve performance of the re-identification. Re-ranking works by post-processing the initial ranking results. Following the common practice in literature, we have used the k-reciprocal encoding re-ranking algorithm [39]. For a given query sample, this algorithm assumes that a correct match in the list of top-\(k_1\) retrieved samples is likely to have the query itself retrieved within the top-\(k_2\) positions when queried to the re-identification module (\(k_2 < k_1\)). Thus, the re-ranking step computes a derived distance metric by separately analyzing the k-reciprocal neighbors for all samples and computing their Jaccard distance. Finally, the final distances are computed by a weighted sum of the original and Jaccard distances, where individual contributions are weighted by \(\lambda \) and \(1-\lambda \), respectively. (5) Our ImageNet pre-trained baseline architectures are trained with the Adam optimizer [19] for 25 epochs with an initial learning rate of \(3\times 10^{-4}\), which is reduced to one-tenth of this value after every \(10^{th}\) and \(15^{th}\) epochs. Batch-hard parameters P and K are set to \(P=5\) and \(K=4\), while \(\gamma \) is set to unity where applicable, the weight decay is set to \(5\times 10^{-4}\) and the triplet loss margin, m is set to unity. For data augmentation, random horizontal flipping is employed during training and all images are resized into a fixed size of \(128\times 384\) pixels. During testing, we calculate the features for both the original images and their horizontally flipped versions and average them to compute the final feature vector for each image. The re-ranking parameters, \(k_1\), \(k_2\), and \(\lambda \) are set to 20, 6, and 0.3, respectively.

Table 1 Performances of various network architectures, given by percentage scores of mAP, Rank-1 (R-1), Rank-3 (R-3), etc. for ResNet18, ResNet34, ResNet50, ResNet101, ResNet152, DenseNet121, DenseNet161, DenseNet169, DenseNet201, and MobileNet

Result analysis According to Table 1, increasing the number of layers and thereby, the complexity of the deep models improves the performance. For example, the slope of increment is sensible going from ResNet18 RR to ResNet50 RR, where the mAP improves by 2.1%. However, the higher number of layers in the ResNet121 RR eventuates in only a slight growth of mAP, compared to the ResNet50 RR. Therefore, it can be concluded that a re-identification network based on ResNet50 can provide a more reliable performance on this dataset. Additionally, according to our experiments, using only a softmax CE loss cannot provide adequate discrimination in feature space. However, combining this loss with triplet loss improves the performance of the baseline architectures. For example, ResNet50 RR employing only triplet loss generates 1.1% lower mAP compared to the combined loss function version. It is also important to mention that employing the re-ranking technique always improves the mAP, while decreasing the rest of the metrics (Rank-1, R-3, R-5, R-10). Our explanation for obtaining lower ranks when utilizing the re-ranking technique is the high frequency of having vessels with the same appearance but of different identities in VR-VCA, as explained in Sect. 4.1.

5.2 Validation of MVR-net

This subsection specifically evaluates the MVR-net performance on our vessel re-identification dataset. To this end, the MVR-net is compared with two state-of-the-art re-identification networks, PRN and MGN. The implementation details are first provided and then the obtained results are analyzed. Finally, two separate topics are discussed: our network design, and the batch-hard sampling strategy utilized by the triplet loss function.

Implementation details of MVR-net The MVR-net architecture is trained with the Adam optimizer for 25 epochs with an initial learning rate of \(2\times 10^{-4}\), which is reduced to one-tenth after the \(15^{th}\) and \(20^{th}\) epochs. Batch-hard parameters P and K are set to \(P=5\) and \(K=4\). Parameter \(\gamma \) is set to \(\gamma =2\), the weight decay is set to \(5\times 10^{-4}\) and the triplet loss margin m is set to unity. For data augmentation, random horizontal flipping and random erasing are employed during training and all images are resized to a fixed size of \(128\times 384\) pixels. During test time, we calculate the features for both the original images and their horizontally flipped versions and average them to compute the final feature vector for each image.

Validation results for MVR-net The network is compared with the MGN and PRN re-ID architectures. Table 2 illustrates the results, both with and without the re-ranking (RR) technique. Here, the results are only analyzed and discussed based on the models with the re-ranking technique. According to the table, the proposed MVR-net outperforms the PRN and MGN with 2.9% and 8.7% mAP, respectively. The improved performance holds also for other evaluation metrics of re-identification. For example, MVR-net generates 4.3% and 6.6% higher Rank-1 compared to PRN and MGN, respectively. This performance clearly proves the efficiency of the MVR-net design for maritime surveillance applications.

Table 2 Performance comparison of network architectures and our MVR-net

Side experiment using triplet loss According to common approaches in literature, we have also empowered our triplet loss function with the batch-hard sampling strategy. Implementing this strategy, the triplet block treats all N vessel samples of a mini-batch as an anchor sample once. This means in practice for each of the N samples, we randomly select an anchor image, a positive pair (i.e. another sample of the same vessel), and a negative pair (i.e. a sample from another vessel). For example, this finds the most similar negative pairs (i.e. other samples of the mini-batch with different identities to the anchor sample). Likewise, a similar statement can be made for the most dissimilar positive pairs. The intuition behind utilizing this technique is to minimize the distance between samples with the same identity and to maximize the distance between samples with different identities in the feature space as much as possible. Normally, this happens by comparing the feature similarities of mini-batch samples using metrics like Euclidean distance.

In this work, we have attempted to construct better triplets by choosing most dissimilar positive and most similar negative pairs according to their orientation labels. The motivation is that employing an appearance-based differentiating metric, like Euclidean distance, can result in selecting positive and negative pairs only because of reasons like differences in scene lighting or having extra background-pixels around the vessel (inside the cropped image). Moreover, we manipulated the input mini-batch generation block to force this block create each mini-batch using samples having the same or at least similar type labels. With this, we aimed at increasing the ability of the triplet block in choosing more similar negative pairs. However, after applying the explained type and orientation-based strategy to the triplet loss-calculation process, we have noticed that there is no improvement compared to the MVR-net with the standard batch-hard sampling strategy. We think this happens because for training of the MVR-net, more than 35k random mini-batches have been used. Therefore, the triplet block is already trained sufficiently for different combinations of input images to the network, which explains the lack of improvement. The conclusion of this side experiment is that with batch-hard sampling and the applied triplet loss, the re-identification of the same vessel at another harbor position does not improve from incorporating the viewing angles, but the proposed network finds already sufficient occurrences of the same vessel in the dataset.

Discussion on the MVR-net design As mentioned in Sect. 3.3, we have designed and empirically evaluated several network architectures to find the optimized network architecture for the vessel re-identification problem (i.e. the MVR-net). This part briefly reflects on these experiments with all possible candidate designs, with a limit of splitting the feature embedding up to four partitions. Table 3 illustrates the obtained results. The tested architectures can include two or three branches, depending on whether the channel-wise partitioning is implemented inside the spatial branches (as is the case for PRN) or in a third independent channel branch. According to the table, the architecture of MVR-net yields higher performance both in terms of mAP and R-1. This motivates our preferred architecture that is illustrated in Fig. 1 as our selected re-identification network for maritime surveillance.

Table 3 Performance scores (%) in mAP and Rank-1 for candidate architectures with different branches and partitionings

6 Conclusions

In this paper, we have introduced two main contributions for addressing the vessel re-identification problem. First, we have captured, annotated, and hereby publish a novel vessel re-identification dataset, referred to as VR-VCA. This dataset contains 4, 614 vessel samples from 729 unique vessel identities. Additionally, we have provided eight vessel types and five vessel orientation labels for each dataset sample. The images of the VR-VCA dataset are captured at different locations in the Netherlands. A divergent set of weather conditions, water region types, and backgrounds are represented in VR-VCA. In our dataset, multiple vessels occur with very similar appearances (i.e. model, etc.). Additionally, vessels appear in various aspect ratio distributions and are captured in different distances and orientations to the cameras. These broad variations make the VR-VCA a challenging vessel re-ID dataset.

Performance of different baseline methods is benchmarked with the described dataset. Based on this benchmarking, we have adopted ResNet50 as the backbone network for the vessel re-ID problem. In addition to the dataset, we have introduced a re-identification deep network, MVR-net, specifically designed for maritime surveillance domain. This network architecture achieves reliable re-ID performance on maritime vessels by combining their spatial and channel-wise features. For extracting a better representation of vessels in spatial dimensions, MVR-net employs two separate height-wise and width-wise branches. Since the vessels are captured at different resolutions, the spatial branches partition the feature embedding into three different sets to detect more useful features for each resolution scenario. The proposed network outperforms PRN and MGN, two well-known re-identification networks, with 2.9% and 8.7% mAP, and 4.3% and 6.6% higher Rank-1, respectively. We have validated the MVR-net efficiency by testing several alternative candidate network designs, where it is shown that the adopted architecture yields the highest scores.

For future work, the implementation parameters of baseline networks and the MVR-net can be further tuned. Additionally, we aim at improving the MVR-net performance on VR-VCA, using pose and class information of vessels, and multi-resolution feature pyramids. Moreover, to address the challenge that is imposed by having different vessels with similar appearances, a model refinement focusing on detecting local features of vessel images could be explored.