1 Introduction

Visual tracking has been one of the most fundamental topics in computer vision due to its important roles in numerous applications such as surveillance, human-computer interaction, and automatic driving [120]. It aims to estimate the states (e.g., location, scale, rotation) of a target in a video after specifying the target in the first frame usually using a rectangle. While significant efforts have been made in the past decades, developing a robust tracking algorithm for complicated scenarios is still a challenging task due to interfering factors like heavy occlusion, pose changes, large scale variations, camera motion, and illumination variations.

In recent years, inspired by feature learning based on sparse coding [21], hierarchical features learned by CNNs have greatly boost the performance of visual tracking methods [2227]. To learn discriminative representations, most of existing methods utilize information from image (region) category, namely target or background [24, 26, 2830], and/or from target motion among consecutive frames [31, 32]. Choi et al [24] propose to utilize the category information when learning target/background classification. In [28], Dong and Shen employ the distance relationship among positive, negative, and target template to learning more discriminative features. A feature net, a temporal net, and a spatial net are designed in [31] to extract general feature representation, encode target trajectory, and refine tracking results using local spatial object information, respectively.

Although these methods demonstrated to be effective, they ignore the importance of the ranking relationship among samples, which is able to distinguish whether one positive sample is better than another positive one or not. Different from the image classification task, visual tracking is location-sensitive, which means a good visual tracking CNN model is able to not only tell positive samples from negative ones but also can distinguish the quality of positive samples and hence filter out the one having the largest overlap ratio with the ground truth. As shown in Fig. 1, the left panel shows the cases of most existing classification scores, which can correctly distinguish positive samples from negative ones. However, the order among positive samples cannot be guaranteed. As a result, the best target candidate may not obtain the highest classification score and hence it cannot be filtered out as the tracking result.

Fig. 1
figure 1

Most of existing tracking methods only require to tell positive samples from negative ones as shown by the left panel. However, this is not enough for the visual tracking task as it needs to filter out the one having the largest IoU with the ground truth. To overcome this problem, we propose to predict scores that are consistent with the ranking order in term of IoU with the ground truth as shown by the right panel

To address the abovementioned problem, in this paper, we propose to take advantage of the distance ranking relationship among positive samples to learn more discriminative features. We require that the confidence score order should be consistent with the IoU order of samples with ground truth (IoU is the intersection over union of two bounding boxes). With such a constraint, the model not only is able to tell positive samples from negative ones, but also has the ability to assign the highest confidence to the candidate that is most similar to the target template. In Fig. 1, we show an illustration of the expected scoring scheme on the right panel. In addition, we observe that spatially close samples generally have the same CNN features due to resolution reduction after convolution or pooling operations. To overcome this problem, we also propose to make use of the normalized spatial location information to enhance the difference of spatially neighboring candidates. Figure 2 shows that the proposed approach is able to achieve better performance compared to several state-of-the-art tracking methods.

Fig. 2
figure 2

Comparison of the proposed approach against state-of-the-art tracking methods ADNet on three example sequences challenged by large occlusion (top row), illumination variation (middle row), and fast motion (bottom row). Our approach performs more robustly than the ADNet tracking method

In summary, we make the following contributions in this paper:

  1. 1.

    We propose a tracking method to take the ranking relationship among samples into consideration, which is able to estimate samples’ scores with the consistent ranking in terms of the IoU metric with the ground truth.

  2. 2.

    We propose to take advantage of the location information of samples to distinguish them from each other even in the case they are closely positioned.

  3. 3.

    Extensive experimental results on large object tracking datasets show the effectiveness of the proposed tracking methods in comparison with several state-of-the-art tracking methods.

2 Related work

In this section, we briefly review the closely related tracking methods.

Generally, most of existing tracking methods falls into either the non-CNN-based category or the CNN-based category according to whether CNN features are used. The non-CNN-based tracking methods usually employ the sparse coding framework to obtain effective image representations [17, 3338]. In [17], spatial structure among selected local templates are enhanced to exclude distractors introduced by noisy templates. Lan et al. [33, 35, 36] propose to mine the common and specific patterns in sparse codings so as to discriminate positive and negative examples.

Wang et al. [39] first introduce the deep learning technology into the visual tracking task, where a denoising auto-encoder is employed to learn compact image representations in a self-supervised manner. After that, Hong et al. propose to make use of the gradient back-propagation algorithm to generate a saliency map for the tracking target in order to facilitate to localize the target. These two methods only utilize CNN features extracted from one of the last fully connected (FC) layer. However, the deeper features are rich in semantic information, which benefits the tracker to distinguish target from background. But visual tracking is a location-sensitive task, and deeper features cannot provide spatial details as its low spatial resolution (1 × 1 resolution for FC features). To overcome this problem, [40] proposes to combine CNN features extracted from both shallow layers and deep layers, which shows to be effective on visual tracking benchmarks. One drawback of [40] is that the combination weights for different features are fixed for all frames and all videos. This is not feasible because different features perform best in different scenarios. To overcome this problem, Qi et al. [22] propose to adaptively generate combination weights via an improved Hedge algorithm. The aforementioned methods use pretrained CNN models for the image classification task. Due to the fundamental difference between these two tasks, directly adopting or simply fine-tuning image classification models limit the performance of CNNs. To better adapt to visual tracking task, a multi-domain CNN is designed in [41] to avoid category ambiguity that one class is the tracking target in one video while being background in another video. Nam and Han [41] also introduce hard negative sample mining and bounding box refinement to further improve tracking performance. Very recently, [42] propose to enhance tracking results via pixel-wise object segmentation. The advantage of segmentation based methods is that rotated minimum bounding rectangle instead of axis-aligned box can be determined.

The other line is to develop real-time CNN-based tracking methods. Tao et al. [43] propose the first real-time CNN-based tracking method. They design a siamese network to learn a similarity function and using ROI pooling to reduce repeated feature calculation. The price is to sacrifice tracking accuracy. Later, Bertinetto et al. [44, 45] propose to implement correlation filter learning within the end-to-end CNN training, which achieves a balance between the tracking accuracy and tracking speed. In [46], Yun et al. propose to determine the target state in a new frame via moving the tracking result in the previous frame in left/right/up/down four directions and zoom in of zoom out the bounding box until an stop action is generated. This method avoids selecting tracking target in hundreds of target candidates and hence improve the tracking speed. Very recently, [25] propose to quickly adapt pretrained CNN models to test image sequences via meta-learning, which usually accomplish adaptation to new videos within five iterations. Li et al. [47] propose to integrate region proposal network into siamese network to address the scale problem of siamese network.

Overall, most of existing CNN-based visual tracking methods just make use of image region category information and/or target motion information. They neglect the ranking relationship among samples, and hence, the positive target candidate with highest confidence may not be the best. To overcome this problem, we propose a tracking method to align the confidence ranking consistent with that of IoU score compared to ground truth.

3 Method

In this section, we first detail the proposed neural network and then we describe how to train the network.

3.1 Architecture

The proposed deep convolutional neural network is equipped with two branches in a siamese architecture as shown in Fig. 3. In each branch, the first three layers are used to learn common representations among all kinds of objects, such as corner points and edges. It can be implemented using pretrained CNN models originally designed for image classification, such as AlexNet [48], VGG [49], and ResNet [50]. Here, we adopt the first three layers of VGGM [51] due to its balance between computational cost and classification accuracy. The next three fully connected layers are used to learn high level embeddings of the input image. It is initialized randomly from a Gaussian distribution. Before classification, we concatenate the image embeddings and its spatial information (xi,yi,w/W, h/H), where (xi,yi) denote the coordinate of the top-left point of the input image region, w, h denote the width and height of the image region, and W, H denote the width and height of the video frame.

Fig. 3
figure 3

The main architecture of the proposed neural network. In the training phase, it takes two image regions as input and output the target/background classification scores. Softmax loss and margin ranking loss are employed for training. In the test phase, only one branch is remained

We employ the softmax loss as the supervision for target/background classification:

$$ l_{\text{cls}}(x_{i},y_{i}) = -f(x_{i})_{y_{i}}+\text{log}\left(\sum_{j=0,1}e^{f(x_{i})_{j}}\right) $$
(1)

where yi denotes the class label of the input image region xi, \(\phantom {\dot {i}\!}f(x_{i})_{y_{i}}\) denotes the yith element of the network output f(·), j=0,1 denote the class labels: 0 for background, 1 for target. To constrain the network predicted scores to be consistent with their ranking in terms of IoU with the ground truth, we also adopt the margin ranking loss:

$$ l_{\text{rank}}(x_{i},x_{j}) = \text{max}\left(0, m- \left(f(x_{i})_{1}-f(x_{j})_{1}\right)\right) $$
(2)

where xi,xj denote two input image regions and m denotes the least margin. If the training sample xi has a larger IoU with the ground truth than xj, it should rank before xj which means its probability being the target f(xi)1 should be larger than f(xj)1. The overall loss for a training pair (xi,yi,xj,yj) is

$$\begin{array}{*{20}l} L(x_{i},y_{i},x_{j},y_{j}) = & l_{cls}(x_{i},y_{i}) + l_{cls}(x_{j},y_{j})\\ &+l_{\text{rank}}(x_{i},x_{j}) \end{array} $$
(3)

3.2 Training

The network is trained in the end-to-end scheme using stochastic gradient descent (SGD) with moment 0.9. The training data is sampled according to [41]. In each frame 5500 samples are randomly extracted around the ground truth. The learning rate is fixed to 2e−4. In each iteration, each mini-batch contains 32 positive and 32 negative samples, which have ≥ 0.7 and ≤ 0.5 IoU with the ground truth bounding box, respectively. The network converges at about 200 iterations.

3.3 Tracking

Let xi denotes the ith target candidate, the tracking result is the one with the largest target confidence:

$$ x^{*} = \arg \text{max}_{i=0,\cdots,N}{f(x_{i})} $$
(4)

where N denotes the number of target candidate. According to [41], the model will be updated when the maximum target confidence is less than zero or after a fixed interval (short-term update interval is 20 frames and long-term update interval is 100 frames). The data for model update are sampled in each frame around the tracking result.

4 Experiments

In this section, we first introduce the evaluation protocols. Then, we examine the effectiveness of the proposed tracking method on large scale datasets compared to state-of-the-art tracking methods.

4.1 Evaluation protocols

We adopt the commonly used success plots and precision plots [52] as the main evaluation metrics, which avoid the drawback of using only one threshold to measure the success. The trackers in success plots are ranked in terms of the area under curve (AUC) and are ranked in precision plots in terms of success rate at a threshold of 20 pixels between center points of tracked results and ground truth.

The implementation is based on PyTorch. We sample 300 target candidates in each frame. The model is update every 20 frames or when the largest confidence is negative. The unoptimized code runs at 1 FPS on a machine with an i7-3.4 GHz CPU and a GeForce GTX 1080 GPU. Fine-tuning samples are collected as the video goes, where 50 positive and negative samples are extracted in each tracked frame around the tracking result.

We compare the proposed method with six state-of-the-art tracking approaches including ADNet [46], CFNet [45], HDT [22], MCPF [53], CREST [54], and MetaTracker [25].

4.2 Ablation analysis

In this section, we evaluate the effectiveness of the introduced ranking loss and the spatial location features, respectively. Table 1 presents the tracking performance on the OTB100 dataset in terms of AUC and precision scores. It shows that the tracking performance drops about 2% if the ranking loss is not employed. If the spatial location features are not used, the tacking accuracy drops about 1% in terms of both AUC and precision metrics. These demonstrate the effectiveness of both the ranking loss and the spatial location features.

Table 1 AUC score and precision at a threshold of 20 pixels on the OTB100 dataset for the ablation analysis on the ranking loss (denoted by RL) and spatial location feature (denoted by SLF)

4.3 Quantitative evaluation

In Fig. 4, we provide the overall performance on the OTB100 dataset. It shows that the proposed tracking method achieves favorable performance compared to state-of-the-art trackers such as MetaTracker, CREST, or ADNet. Specially, our tracking method performs about 4% better than CREST, which takes both appearance information and temporal motion information into consideration. In contrast, the only motion cues utilized in our method happens during the target candidate sampling, which adhere by a Gaussian distribution centered at the last tracking result.

Fig. 4
figure 4

Tracking results based on 11 attribute in terms of precision plots

To further evaluate the proposed method, we also conduct the attribute-based performance evaluation on the OTB100 dataset in terms of success plots and precision plots. The results are presented in Figs. 5 and 6. The results in Fig. 5 show that our tracking approach performs best on 7 out of 11 attributes, which includes faster motion, deformation, illumination variation, in-plane and out-of-plane rotations, low resolution, and out-of-view. In the of tracking precision, similar performance can be observed in Fig. 6.

Fig. 5
figure 5

Tracking results based on 11 attribute challenges in terms of success plots

Fig. 6
figure 6

Tracking results based on 11 attribute in terms of precision plots

For completeness, we also present the tracking results on the VOT 2016 dataset [55] in Table 2. The results show that our method performs favorably with an EAO of 0.320 compared against state-of-the-art tracking methods, such as CCOT [56] and SiamFC [44].

Table 2 Tracking results on the VOT2016 dataset in terms of expected average overlap (EAO), accuracy rank (A), and robustness rank (R)

4.4 Qualitative evaluation

In Fig. 7, we present sample tracking results of the evaluated methods on both OTB100 [52] and UAVDT [57] datasets. For presentation clarity, only results of the top 7 performing trackers are shown.

Fig. 7
figure 7

Several samples of tracking results on both the OTB100 dataset and the UAV dataset (from top to bottom, left to right: BlurBody, Girl2, DragonBaby, Ironman, S1701, S0602, S1301, S0103, and S1303)

Occlusion. The target in the UAV-Traffic sequence S0103 undergoes occlusions caused by trees. Only the proposed method and MDNet are able to locate the target, while other trackers such as CREST and ADNet falsely locate on background out of the view as shown in frames 75 and 152. Similar performance happens in the OTB100 video Girl2, where the target girl gradually occluded by a man walking with a bicycle. Such success can be attributed to powerful deep CNN features regularized by both the classification loss and the ranking loss, as well as the spatial location loss.

Camera motion. The target object in the BlurBody image sequence is blurred due to camera shaking. For such an image sequence, the proposed tracking method and CCOT methods are still able to precisely locate the target. In contrast, other trackers such as ADNet and MCPF locate the target with much background as shown in frame 236. The effectiveness of the proposed algorithm benefits from the camera motion branch as evaluated in Table 1. In the UAV-Traffic image sequence S0602, the camera hovers over the crossroads, which leads to huge appearance variance of the target (the blue bus). The bounding boxes show that the proposed approach tracks the target more accurately than others during the hover, while ADNet falsely locates on the road and MCPF fails to identify the target from background as shown in frame 291.

Object motion. The target in the OTB100 sequence DragonBaby hits his opponent using the turn-around kick. As shown by the bounding boxes, both the proposed tracking method and MCPF methods are able to locate the target accurately in such a procedure, but other trackers lose the target, such as ADNet. With reference to ablation evaluations in Table 1, both the spatial location features and the ranking loss helps to capture discriminative information in such a scene.

Scale. In the UAV-Traffic sequence S1701, the size of the target bus changes intensively and the observation view changes from bird-view to side-view, which cause large appearance variations. In such a challenging scene, the proposed ANT method locates the target more accurately than others such as MDNet and HDT, as shown in frames 200 and 324. The performance gain of the proposed algorithm can be mainly attributed to the ranking loss, classification loss, and spatial location features, which learns robust representations under various challenges.

Illumination. The target in Ironman has drastic movements in a dark night with large illumination changes in the backgrounds. In such a poor lighting condition, the proposed algorithm accurately locates the target in most frames while other trackers drift far away as shown in frames 129 and 165. Similar performance can be observed in the UAV-Traffic sequences S1301 and S1303. As evaluated in Table 1, the illumination branch contributes most in such situations.

5 Conclusion

In this paper, we propose a novel tracking method, which takes advantages of the ranking relationship among positive samples to learn more discriminative features so as to distinguish closely similar target candidates. To achieve this goal, we propose a sample ranking method to select discriminative samples. In addition, we also propose a spatial normalization method to make use of the normalized spatial location information to distinguish spatially neighboring candidates. Extensive experiments on challenging image sequences demonstrate the effectiveness of the proposed algorithm against several state-of-the-art methods.