1 Introduction

With the continuous development of computer hardware and artificial intelligence algorithms in the resent years, using machines to assist or partially replace human is the current trend in the field of information technology. As one of the basic research areas of computer vision, target tracking method can provide computers with important target motion-related information, which helps computers to analyze and understand the behavior of targets to make decisions and actions. Although there are many requirements for target tracking, the difficult factors of target tracking are doubled due to complicated scene and diversified requirements. During tracking, the appearance of the target may change drastically due to the rotate, occlusion, etc. There are also dramatic changes in lighting, rapid movement of the target, camera-shake, and other situations, which make the target cannot be well displayed in the image sequences. In practice, all need to be trade-off between software and hardware environment, speed requirements, accuracy and robustness of the algorithm that further increases the difficulty of the algorithm research. Especially, those problems are huge challenge for the UAV target tracking [1, 2].

In recent years, experts, scholars, and engineers have invested a lot of research, proposed a variety of single target tracking algorithms, and builded various databases in order to solve these problems. Smeulders et al. [3] summarized the 19 most representative algorithms of nearly 10 years. Those algorithms were divided into five classes: matching, matching with extended appearance, matching with constraints, discriminative classification, discriminative classification with constraints. In the recent research literatures [47], the tracking methods are classified generative model and discriminative model.

Generative tracking methods describe the target appearances using generative models and search for the target regions that fit the models best. In general, the target model is represented by a subspace or a basis vector consisting of a series of templates. In order to better learn object appearance, Ross et al. [8] proposed incremental visual tracking (IVT) which used principal component analysis (PCA) to learn a low-dimensional subspace representation, and online updated the target changes. In the same year, Han et al. [9] applied the mixed probability density estimation model to the target tracking algorithm and worked well. In 2010, Kwon et al. [10] proposed visual tracking decomposition (VTD) model which provided an efficient strategy for dividing tracking problem into basic observation model and motion model. Visual tracking decomposition method integrates multiple basic trackers into one robust compound tracker while interactively improves the performance of all basic trackers. Among the generative tracking methods, the most representative algorithm is the tracking algorithm based on the sparse representation. Mei et al. [11, 12] proposed L1T algorithm of sparse representation with 1 norm to reduce the effect of object internal factor (such as rotation, scale transform) and object external factor (such as illumination) change. Extracting features based on the appearance model from the data-independent multi-scale image space was used by Zhong et al. [13] to improve the efficiency of the algorithm. That same year, Zhang et al. found that the potential relationship between the sampling particles can improve the performance of the tracking algorithm, and then they proposed the compressive tracking (CT) [14]. Luka and Matej [15] proposed a coupled-layer visual model optimization method to solve rapid and significant appearance changes. Zhou et al. [16] proposed sparse heterogeneous feature representation (SHFR) for multi-class heterogeneous domain adaptation (HDA) to learn a sparse feature transformation between domains with multiple classes.

Different from the generative tracking methods, discriminative tracking methods treat the target tracking process as a binary classification problem. The methods use a classifier to separate the target from the background. Babenko et al. [17] used online multi-instance learning to capture positive and negative samples with uncertainty as a classification algorithm for target tracking. Kalal first [18] used unlabeled structured data and a semi-supervised learning algorithm to design online tracking method. Tracking learning detection (TLD) [19] algorithm was proposed by adding redetection after the tracking failed. Subsequently, Hare proposed Struck [20] algorithm which used online tructured output based on support vector machine. The MIL tracker [21] integrated the sample importance into an efficient online learning to improve the preformance of classifier. Among discriminative tracking algorithms, the tracking algorithm based on correlation filter (CF) stood out and developed rapidly with its high speed and high efficiency. Bolme et al. [22, 23] proposed MOSSE (minimum output the sum of squared error) algorithm based on correlation filter, which transformed the image from spatial domain to frequency domain, greatly reduced the memory requirements and computational burden. João et al. [24] firstly introduced cyclic matrix into the visual tracking method based on correlation filter, and then the tracking method by linear space was extended to nonlinear space in [25]. Yao et al. [26] proposed RTINet approach for joint off-line training of deep representation and model adaptation in CF trackers.

However, in complex environment, the discriminative tracking algorithm can perform better. This is due to the use of negative samples in the discriminative model, which can avoid drifting in the tracking process. Generally speaking, combining the two models can achieve better results than a single model. Wang et al. [4] proposed the method of online non-negative dictionary learning (ONNDL), which was a good combination of the generative model and the discriminative model. Yang et al. [27] combined dictionary learning with positive and negative label information and proposed an online discriminative dictionary learning tracking method.

In this paper, online non-negative discriminative dictionary learning for tracking (ONDDLT) algorithm is proposed, which combines the advantages of the global dictionary learning model and the class-specific dictionary learning model. The contributions of this paper are summarized as follows:

  • To solve the residual growth problem of objective function and improve the robustness of matrix of singular value, the 1 norm is replaced by Huber loss function.

  • Fisher weight coefficient is used to replace the support vector algorithm of adaptive weight coefficient in the discriminative term in order to make the objective function easier to solve.

  • Non-negative constraints on dictionaries are added to enhance the interpretability and system performance.

  • Experimental results on the tracking benchmark shows that our tracker achieves the first tracking performance compared with other methods based on sparse coding in this paper.

2 Related work

2.1 The appearance representation in tracking

It’s difficult to solve how to use the appearance of the target object and its features to represent the target in the visual tracking. In the current research of tracking algorithms, different ideas and methods are proposed to solve the problem of object appearance representation. In [4, 11, 12, 14, 15, 27, 28], the target object image was used as the dictionary atom after feature extraction, and the new target image was used to update the dictionary in the tracking process, so as to reconstruct the apparent model of the target in different periods of time. In [5, 29, 30], in order to cope with the gesture change and occlusion of the target object well, the target was divided into multiple parts. Feature extraction is also one of the important ways of object representation. In [30], color histogram statistics in color space were used as the representation characteristics of the target. In [31], a variety of usual feature extraction methods were used to combine and form new features by utilizing complementary information between features. In the research [6, 7, 3235], deep leaning (DL) was used as the extraction method of tracking algorithm and obtained great success. Wang et al. [32] proposed the point-to-set distance metric learning which was conducted on convolutional neural network features of the training data extracted from the starting frames. Lei Qu et al. [36] integrated fast histogram of oriented gradient (FHOG) and discriminative color descriptors (DD) to further boost the tracking performance.

2.2 Discriminative dictionary learning

The goal of discriminative dictionary learning is to enhance the discriminative ability of the coefficient vector while learning the dictionary. There are two learning strategies: the global dictionary learning model and the independent dictionary learning model. The global dictionary learning model is to learn a dictionary whose atom corresponds to all categories of the training set. Mairal et al. [37] explored the structured information of the dictionary through the classifier trained by the coefficient vector, thereby improving the recognition and classification ability of the discrimination dictionary. Pham et al. [38] proposed joint optimization K-SVD face recognition discriminant dictionary learning. In [39], linear SVM (support vector machine) was used to simultaneously optimize the dictionary and classifier that made the dictionary and coefficient vector more adaptive and flexible. These global dictionary learning could use a small dictionary to represent the training data but they ignored the relationship between the category label and the dictionary atom. The independent dictionary learning model means that each class corresponds to a single dictionary and each dictionary atom corresponds to only one class. Structured dictionary learning model proposed by Ramirez et al. [40] could improve the discriminative ability of sub-dictionaries between different categories. In [41], author proposed an unified joint discriminative feature learning framework in which uncontaminated and corrupted features, classier parameters of multiple visual cues. This paper [42] proposed to jointly learn heterogeneous features and classifiers for multi-modality tracking under discriminabilty-consistency constraint. In [43], they proposed to extract informative feature templates and exploit the modality consistency in discriminability and representation ability for modality fusion-based appearance modeling. Yang et al. [44] explored the Fisher discriminant criterion to learn the discriminant dictionary.

2.3 Tracking algorithm based on dictionary learning

The online dictionary learning tracking method is the target tracking method based on sparse coding technology. Different from general sparse coding, the training samples of the target template dictionaries are increasing, and the dictionaries are required to maintain a high update speed. Accordingly, the online dictionary learning algorithm can reduce the update time of the general dictionary, so as to meet the online target tracking method’s demand for the update speed as much as possible. In L1T [11], the basis vector which was made up of the target template and the minor template was used to describe the target. The linear combination of the sparse basis vector was used to reconstruct the candidate region particles. The target template corresponded to the appearance of the target. The minor template was mainly used to deal with noise and occlusion. Zhang et al. [38] proposed a novel tracking model which used a semi-supervised appearance dictionary learning method. In general, a small number of minor template could significantly reduce the reconstruction error. In the online non-negative dictionary learning target tracking method (ONNDL) [4], the Huber loss function was used to instead of the minor template, thereby reducing the calculation consumption. Mathematically, it is correct to have negative values in the decomposition results from a computational point of view, but negative elements are often meaningless in practical problems. This is why non-negative constraint able to enhance the tracking performance. Both [4548] were related with sparse coding. The method of sparse coding combined with non-negative constraint could improve the robustness and accuracy of the model. Sparse dictionary learning as same as sparse coding could combined with non-negative constraint. In addition, the dictionary learning method of mapping gradient descent model was adopted to solve the problem of online dictionary learning.

3 Proposed method

3.1 The objective function of algorithm

In general discriminant dictionary learning, training samples and their labels are all known in advance, and dictionaries can be fully learned through corresponding training. In the process of target tracking, the results of each tracking provide new training samples and labels, and the dictionary is constantly updated in the process. Accordingly, we adopt the same method as Wang [4], mapping gradient descent method, which can make the dictionary faster and better. The Huber loss function is slower than the 2 norm when the residuals increase, which is conducive to the robustness of singular values. Therefore, in the target function, the Huber loss function is used to replace the norm as the reconstruction term of dictionary learning. At the same time, the 1, norm is used as regularization term inspired by the class correlation. It can fully use the inter-class relations and tag information within the class. In conclusion, the objective function is shown in Eq. (1).

$$ {{} \begin{aligned} \min_{\boldsymbol{D},\,\boldsymbol{A}}f(\boldsymbol{D},\boldsymbol{A})=\sum_{i} &\sum_{j} \ell_{\delta}(x_{ij}-\boldsymbol{d.}_{i} \boldsymbol{a.}_{j})+\gamma\sum_{c}\|\boldsymbol{A}_{c}\|_{1,\infty}\\[-4pt] &+\frac{\beta}{2}\mathcal{L}(\boldsymbol{A},\boldsymbol{W}) s.t. \boldsymbol{D}\geqslant 0, \boldsymbol{A}\geqslant 0,\boldsymbol{d.}_{k}^{T} \boldsymbol{d.}_{k} \\&\leqslant 1,\forall k. \end{aligned}} $$
(1)

where D is the matrix of dictionary template, A is the matrix of expression coefficient, xij is the element of row and column of training sample, d. is the vector of dictionary template and a. is the vector of expression coefficient. \(\mathcal {L}(\boldsymbol {A},\boldsymbol {W})\) is the discriminative term. W is the weight the weight coefficient matrix of the discriminative terms. \(\boldsymbol {A}\geqslant 0\) is to make sure the coefficients are not negative. γ and β are parameters that can be set manually to adjust the effects of the regularization term and the discriminative term. δ(·) represents the Huber loss function. The specific form is shown in Eq. (2).

$$ { \ell_{\delta} ({\mathrm{r}})} = \left\{\begin{array}{ll} \frac{1}{2}r^{2}& \,\,|r| < \delta\\ \delta |r| - \frac{1}{2}{\delta^{2}}&\,\,\text{otherwise} \end{array} \right. $$
(2)

where δ is the parameter of Huber loss, and it controls the velocity of gradient descent. In the previous paper, the support vector of the weight coefficient was used to represent the discriminative term. The solution of Huber loss function is too complex. In the method of Cai [39], the Fisher discriminative can be simplified into Eq. (3). According

to the Fisher discrimination criterion, a structured dictionary, whose dictionary atoms have correspondence to the class labels, is learned so that the reconstruction error after sparse coding can be used for pattern classification. Meanwhile, the Fisher discrimination criterion is imposed on the coding coefficients so that they have small within-class scatter but big between-class scatter.

$$ {\begin{aligned} \mathcal{L}({\boldsymbol{A}}) = \sum\limits_{c = 1}^{C} \left(\sum\limits_{{y_{i}} = c,{y_{j}} = c} {\left(\frac{1}{{{n_{c}}}} - \frac{1}{{2n}}\right)\left\| {{{\boldsymbol{a.}}_{i}} - {{\boldsymbol{a.}}_{j}}} \right\|_{2}^{2}} \right.\\[-4pt] \left.+ \sum\limits_{{y_{i}} = c,{y_{j}} \ne c} {- \frac{1}{{2n}}\left\| {{{\boldsymbol{a.}}_{i}} - {{\boldsymbol{a.}}_{j}}} \right\|_{2}^{2}} \right) \end{aligned}} $$
(3)

In Eq. (3), yi=c means the label is class c, otherwise, the label is not class c. It can be seen from this formula, the weight coefficient between the same class and different class is relatively fixed, so as to increase the discriminating ability and reduce the computational complexity. Since the coefficient of the interclass term in Eq. (3) is negative, it cannot be proved that the objective function is convex. In the process of tracking, the basis vector of the target dictionary is constantly updated, and the context relation of the basis vector of the background dictionary is required to be as weak as possible, and the new background template is also used to update. Therefore, the influence on the dictionary discriminative will be small when the class terms are removed, and the objective function can be guaranteed to be a convex function for solving. Further, the discriminant term in Eq. (3) is simplified as

$$ \mathcal{L}({\boldsymbol{A}},{\boldsymbol{W}}) = {\left\| {{{\boldsymbol{A}}^{\mathrm{T}}}{\boldsymbol{WA}}} \right\|_{1}} $$
(4)

In Eq. (4), \({\boldsymbol {W}}{\mathrm { = }}\left [ {\begin {array}{*{20}{c}}{{{\boldsymbol {W}}_{o}}}&\boldsymbol 0\\ \boldsymbol 0&{{\boldsymbol {W}}_{b}}\end {array}} \right ]\), Wo is the weight coefficient matrix of target and Wb is the weight coefficient matrix of background, both can be calculated by using Eq. (2). Here, we only give the calculation of Wo:

$$ {{\boldsymbol{W}}_{o}} = \left[ {\begin{array}{*{20}{c}} {2 + \frac{{{n_{0}}}}{n} - \frac{4}{{{n_{0}}}}}& \cdots &{\frac{1}{n} - \frac{2}{{{n_{0}}}}}\\ \vdots & \ddots & \vdots \\ {\frac{1}{n} - \frac{2}{{{n_{0}}}}}& \cdots &{2 + \frac{{{n_{0}}}}{n} - \frac{4}{{{n_{0}}}}} \end{array}} \right] $$
(5)

In Eq. (5), n0 represents the number of samples of this class, and n represents the total number of samples. In conclusion, the objective function of the online non-negative discriminative dictionary learning tracking model is written again as

$$ {\begin{aligned} \begin{array}{l} {\underset{{\boldsymbol{D}},{\boldsymbol{A}}}{\min}} f({\boldsymbol{D}},{\boldsymbol{A}}) = \sum\limits_{i} {\sum\limits_{j} {{\ell_{\delta} }({x_{ij}} - {{\boldsymbol{d.}}_{i}}{{\boldsymbol{a.}}_{j}})}} + \gamma \sum\limits_{c} {{{\left\| {{{\boldsymbol{A}}_{c}}} \right\|}_{1,\infty }}} {\mathrm{ + }}\frac{\beta }{{\mathrm{2}}}{\left\| {{{\boldsymbol{A}}^{\mathrm{T}}}{\boldsymbol{WA}}} \right\|_{1}} \\ s.t.\quad {\boldsymbol{D}} \ge {\mathrm{0}},{\boldsymbol{A}} \ge {\mathrm{0, }}{\boldsymbol{d.}}_{k}^{\mathrm{T}}{{\boldsymbol{d.}}_{k}} \le {\mathrm{1, }}\forall {\mathrm{k}} \end{array} \end{aligned}} $$
(6)

3.2 The solution of the expression coefficient A

After obtaining the dictionary template D, there are two expression coefficients to be solved. First, the expression coefficients of candidate particles need to be solved for finding the target by the relevant generation function or discriminant function. Second, the corresponding class sparse coefficient should be solved when the template dictionary online is updated. In this section, we mainly introduce the solution method of the corresponding class sparse coefficient. Here, the objective function in Eq. (6) is a convex function with constraint term \((\boldsymbol A \geqslant 0)\). The constraint term is written into the objective function as shown in Eq. (7) by using Lagrange multiplier method.

$$ {}\begin{aligned} \left\langle {\boldsymbol{A}} \right\rangle &= {\underset{\boldsymbol{A}}{\min}} \sum\limits_{i} {\sum\limits_{j} {{\ell_{\delta} }({x_{ij}} - {{\boldsymbol{d.}}_{i}}{{\boldsymbol{a.}}_{{j}}})}} + \gamma \sum\limits_{c} {{{\left\| {{{\boldsymbol{A}}_{c}}} {{{\boldsymbol{A}}_{c}}} \right\|}_{1,\infty }}}\\[-4pt] &\quad + tr{\mathrm{(}}{\boldsymbol{\Phi}^{{\mathrm{T}}}}{\boldsymbol{A}}{\mathrm{) + }}\frac{\beta }{{\mathrm{2}}}{\left\| {{{\boldsymbol{A}}^{\mathrm{T}}}{\boldsymbol{WA}}} \right\|_{1}} \end{aligned} $$
(7)

Among them, the tr(·) is matrix rank, Φ is the Lagrange multiplier. Due to the existence of the 1, norm, the above equation is a non-smooth convex function and has not a closed solution. For the solution of 1, norm, other parts must be smooth convex function. At this point, we introduce the separation variable A and divide the solution into solving two unknown approximate functions. Then, the objective function about A is rewritten as Eq. (8):

$$ \begin{aligned} \left\langle \boldsymbol{A},\boldsymbol{A}^{\prime} \right\rangle = & {\underset{\boldsymbol{A},\,\boldsymbol{A}^{\prime}}{\text{min}}} \sum\limits_{i} \sum\limits_{j} \ell_{\delta}(x_{ij} - \boldsymbol{d.}_{i}\boldsymbol{a.}_{j}) + \gamma \sum\limits_{c} {{{\left\| {{\boldsymbol{A}}^{\prime}_{c}} \right\|}_{1,\infty }}} \\[-4pt] &+ tr(\boldsymbol{\Phi}^{\mathrm{T}}{\boldsymbol{A}}) + \frac{\alpha }{{\mathrm{2}}}\left\| {{\boldsymbol{A}} - {\boldsymbol{A}}^{\prime}} \right\|_{2}^{2}{\mathrm{ + }}\frac{\beta }{{\mathrm{2}}}{\left\| {{{\boldsymbol{A}}^{\mathrm{T}}}{\boldsymbol{WA}}} \right\|_{1}} \end{aligned} $$
(8)

Since there are two unknown variables in the target function, and all unknown variables cannot be solved at one time, the most similar value needs to be obtained by multiple cross iterations (ADMM) as the solution of the objective function, so the solution can be solved in two steps again.

A)For sub-problem A, we can re-design it as

$$ {\begin{aligned} \left\langle {\boldsymbol{A}} \right\rangle = {\underset{\boldsymbol{A}}{\min}} \sum\limits_{i} {\sum\limits_{j} {{\ell_{\delta} }({x_{ij}} - {{\boldsymbol{d.}}_{{i}}}{{\boldsymbol{a.}}_{{j}}})} } {\mathrm{ + }}tr{\mathrm{(}}{{\boldsymbol{\Phi }}^{\mathrm{T}}}{\boldsymbol{A}}) \\+ \frac{\alpha }{{\mathrm{2}}}\left\| {{\boldsymbol{A}} - {\boldsymbol{A}}^{\prime}} \right\|_{2}^{2}{\mathrm{ + }}\frac{\beta }{{\mathrm{2}}}{\left\| {{\boldsymbol{A}}^{\mathrm{T}}{\boldsymbol{WA}}} \right\|_1} \end{aligned}} $$
(9)

For the above equation, there is no closed solution, so the following update method that satisfies the KKT condition is used to iterate the expression coefficient A until it converges.

$$ {\begin{aligned} a_{kj}^{p + 1} = a_{kj}^{p}\frac{{{{\left[ {{{({{\boldsymbol{Z}}^{p}} \odot {\boldsymbol{X}})}^{\mathrm{T}}}{\boldsymbol{D}}} \right]}_{kj}}}}{{{{\left[ {({{\boldsymbol{Z}}^{p}} \odot {{({\boldsymbol{D}}{{({{\boldsymbol{A}}^{p}})}^{\mathrm{T}}}))}^{\mathrm{T}}}{\boldsymbol{D}} + \frac{\beta }{2}{{\boldsymbol{A}}^{p}}{\boldsymbol{W}}} \right]}_{kj}} + \alpha \left(a_{kj}^{p} - {a^{\prime}}_{kj}^{p + 1}\right)}} \end{aligned}} $$
(10)

In Eq. (10), p represents the pth iteration, ⊙ represents the element dot product between the matrix, X is the matrix of training sample, and zij of matrix Z represents the weight coefficient of the jth characteristic of particle i, matrix Z can be obtained by Eq. (11).

$$ {{z}}_{ij}^{p} = \left\{ {\begin{array}{*{20}{c}} 1&{\left| {r_{ij}^{p}} \right| < \delta }\\ {\frac{\delta }{{r_{ij}}}}&{{\text{otherwise}}} \end{array}} \right. $$
(11)

where rij=xijd.ia.j is the reconstruction of residual.

B)The corresponding sub-problem A, and the objective function is as follows:

$$ \left\langle {{\boldsymbol{A}}^{\prime}} \right\rangle = {\underset{{\boldsymbol{A}}^{\prime}}{\min}} \gamma \sum\limits_{c} {\left({{\left\| {{\boldsymbol{A}}^{\prime}_{c}} \right\|}_{1,\infty }} + \frac{\alpha }{{\mathrm{2}}}\left\| {{{\boldsymbol{A}}_{c}} - {\boldsymbol{A}}^{\prime}_{c}} \right\|_{2}^{2}\right)} $$
(12)

3.3 Dictionary template update

In the process of target tracking, in order to catch the change of target appearance in time, the new target samples are used to update the appearance presentation model constantly. In this section, it is mainly to realize the dictionary template update of the target object. Assuming that at frame l, the algorithm has obtained the position and size of the target. The target in this frame will be taken as new training samples and the corresponding dictionary will be updated. This is different from the dictionary learning method and is similar to the online dictionary learning algorithm of Mairal et al. [49] and the online non-negative dictionary learning tracking algorithm of Wang et al. [4]. Here, we adopt the dictionary updating method of Wang et al. [4].

Generally speaking, in the process of target tracking, the probability of drastic change is very small, so the target between every two consecutive frames is very similar. Therefore, the training samples can be approximately divided into low-rank components and sparse components. The sparse components represent occlusion or other changes. In this way, the dictionary can automatically reduce the effect of occlusion when it is updated. Here, the algorithm still uses Eq. (6) as the target function. The optimization problem of Eq. (6) is divided into two parts: the expression coefficient A and the dictionary template D. The solution of the expression coefficient A was given above. For the optimization of dictionary template D, although it can be updated incrementally with a limited batch size, it needs to be completely recalculated while the new images are being inputted. Here, the mapping gradient descent method is used to solve this optimization problem as shown below:

$$ \widetilde {\boldsymbol{d.}}_{i}^{t} = {\boldsymbol{d.}}_{i}^{t} - \eta \nabla h({\boldsymbol{d.}}_{i}^{t}){\mathrm{ }},{\mathrm{ }}{\boldsymbol{d.}}_{k}^{t + 1}{\mathrm{ = }}\prod {(\widetilde {\boldsymbol{d}}_{k}^{l})} {\mathrm{ }} $$
(13)

In Eq. (13), \(\nabla h({\boldsymbol {d.}}_{i}^{t})\) is the gradient vector, and η is the update stride. The gradient vector ∇h(d.i;A) corresponding to each dictionary atom d.i is shown as follows:

$$ \nabla h({\boldsymbol{d.}}_{i}) = \frac{{\partial h({\boldsymbol{d}}._{i};{\boldsymbol{A}})}}{{\partial {\boldsymbol{d}}._{i}}}{\mathrm{ = }}{{\boldsymbol{A}}^{\mathrm{T}}}{{\boldsymbol{\Lambda }}_{i}}{\boldsymbol{Ad}}._{i} - {{\boldsymbol{A}}^{\mathrm{T}}}{{\boldsymbol{\Lambda }}_{i}}{y_{i}} $$
(14)

where Λi denotes the diagonal matrix whose elements is the ith row in Wt. \(\prod (\boldsymbol x)\) is an mapping calculation that each column element of D is mapped to the convex set \({\mathcal {C}} = \{ {\boldsymbol {x}}:{\boldsymbol {x}} \ge 0,{{\boldsymbol {x}}^{\mathrm {T}}}{\boldsymbol {x}} \le 1\} \). While solving the problem that the dictionary atom cannot be negative, it also avoids the problem of atomic scalability. Inspired by Mairal et al. [49]’s online matrix decomposition and dictionary learning, the two matrices of Eq. (13) are taken as sufficient statistical information of the sample. So that the algorithm can online update. When updating the frame l, the matrix is defined as

$$ \begin{array}{l} {\boldsymbol{U}}_{i}^{l} = {({{\boldsymbol{A}}^{l}})^{\mathrm{T}}}{{\boldsymbol{\Lambda }}_{i}}{{\boldsymbol{A}}^{l}}\\[-4pt] {\boldsymbol{V}}_{i}^{l} = {({{\boldsymbol{A}}^{l}})^{\mathrm{T}}}{{\boldsymbol{\Lambda }}_{i}}{{\mathrm{y}}_{i}} \end{array} $$
(15)

After obtaining the result of frame l+1, the update rules of matrix Ui and Vi are

$$ \begin{array}{l} {\boldsymbol{U}}_{i}^{l + 1} = \rho {\boldsymbol{U}}_{i}^{l} + {{\boldsymbol{a.}}_{l + 1}}{\boldsymbol{a.}}_{l + 1}^{\boldsymbol{T}}\\[-4pt] {\mathrm{ }}{\boldsymbol{V}}_{i}^{l + 1} = {\mathrm{ }}\rho {\boldsymbol{V}}_{i}^{l} + {{\boldsymbol{a.}}_{l + 1}}{y_{i}} \end{array} $$
(16)

In the formula, the ρ is the forgetting factor. It is the exponential reduction of previous data. In summary, the atomic update rules of the dictionary template are as follows:

$$ {} \widetilde{\boldsymbol{d.}}_{i}^{l} = {\boldsymbol{d.}}_{i}^{l} - \eta \left({\boldsymbol{U}}_{i}^{l + 1}{\boldsymbol{d.}}_{i}^{l + 1} - {\boldsymbol{V}}_{i}^{l + 1}\right){\mathrm{ }},{\mathrm{ }}{\boldsymbol{d.}}_{k}^{l + 1}{\mathrm{ = }}\prod {(\widetilde {\boldsymbol{d.}}_{k}^{l})} {\mathrm{ }} $$
(17)

3.4 Target positioning module

When locating the target, the feature extraction and selection of the target image are needed first. Generally, rectangular bounding box is used as the size and position of the target. However, the target is not always rectangular, so even in the correct target image blocks or the real target location and size (ground truth), it is inevitable to contain a small number of background areas. In addition, the deformation or occlusion of target will have adverse effects for tracking. These effects can be reduced if the invariant feature and informational characteristics of the target are selected. For this reason, the feature selection by logistic regression with 1 norm is adopted in this section as shown below:

$$ {\underset{\min}{\boldsymbol{w}}} \sum\limits_{i} {\log \{ 1 + \exp [ - {l_{i}}({{\boldsymbol{w}}^{\mathrm{T}}}{{\boldsymbol{y}}_{i}} + b)]\} + \xi {{\left\| {\boldsymbol{w}} \right\|}_{1}}} $$
(18)

where yi is a sample in the previous frame, and li is the corresponding label. When the value is 1, it indicates that yi is a positive sample. When the value is −1, it indicates that yi is a negative sample. By feature selection, the computational complexity of the algorithm is reduced while the robust robust discriminative samples are provided. In a series of tracking algorithm studies, it is shown that detailed grid search is not suitable for most algorithms. Because that high similarity between samples leads to redundancy and the redundancy and the computational complexity increases with the square multiple of the target image size. Therefore, we use the particle filter based on the sequence monte carlo (SMC) model to select the samples. Particle filter is a kind of sample selection method and is used frequently in visual tracking due to its simplicity and high efficiency. The particle filter dynamically provides a candidate sample for the tracking algorithm by estimating the hidden state sequentially by observing the sequence. The hidden state variable of particle filter does not need to strictly follow Gaussian distribution or some distribution with parameters. At the same time, with the increase of the number of filters, the approximation precision also increases. In addition, the probability distribution of the hidden state variables can make algorithm easier to recover from tracking failure.

The objective function of Eq. (6) is adopted as the search method when we choose the suitable target from the candidate particles. When the target is positioned, the particle representation coefficient is independent existence rather than a certain class or group. Therefore, the regularization term of the target function cannot be applied to encode the coefficient of a class or group with the 1, norm. Here, the regularization term of the target function can be redefined as Eq. (18) after adopting 1 norm.

$$ {\begin{aligned} {\underset{\boldsymbol{A}}{\min}} \sum\limits_{i} {\sum\limits_{j} {{\ell_{\delta} }({x_{ij}} \,-\, {{\boldsymbol{d.}}_{{i}}}{{\boldsymbol{a.}}_{{j}}})}} {{ + }}\gamma \left\| {\boldsymbol{A}} \right\|_{1} \,+\, tr{\mathrm{(}}{{\boldsymbol{\Phi }}^{\mathrm{T}}}{\boldsymbol{A}}{\mathrm{) + }}\frac{\beta }{{\mathrm{2}}}{\left\| {{{\boldsymbol{A}}^{\mathrm{T}}}{\boldsymbol{WA}}} \right\|_{1}} \end{aligned}} $$
(19)

Similar to the Eq. (10), there is no closed solution in the above equation, but the expression coefficient A can be iterated until convergence by using the following update method which satisfies the KKT condition:

$$ a_{kj}^{p + 1} = a_{kj}^{p}\frac{{{{\left[ {{{({{\boldsymbol{Z}}^{p}} \odot {\boldsymbol{X}})}^{\mathrm{T}}}{\boldsymbol{D}}} \right]}_{kj}}}}{{{{\left[ {({{\boldsymbol{Z}}^{p}} \odot {{({\boldsymbol{D}}{{({{\boldsymbol{A}}^{p}})}^{\mathrm{T}})})}^{\mathrm{T}}}{\boldsymbol{D}} + \frac{\beta }{2}{{\boldsymbol{A}}^{p}}{\boldsymbol{W}}} \right]}_{kj}}}} $$
(20)

After obtaining the representation coefficient, the particle with the maximum reconstruction value in the target dictionary template is usually used as the predicted value, but this method is easy to cause the problem of target drift. The target dictionary template and the background dictionary template are combined to improve the robustness of the algorithm. That is, the reconstruction value of the target should be as large as possible and the reconstruction value of the background should be as small as possible. Thus, μ(∥Doao1−∥Dbab1) is used as the target function, the subscript o,b respectively represents the target and background. The parameter μ is mainly used to control the sparse representation of particles and constraint representation of background.

The entire steps of our ONDDLT are summarized in Algorithm (??).

4 Experimental results and discussion

4.1 Visual tracker benchmark

In this section, our trackers is evaluated on OTB50 [50] and OTB100 [51] datasets. The OTB50 dataset with 50 fully annotated sequences is to facilitate tracking evaluation. In order to increase the robustness of evaluation, the OTB100 dataset adds 50 videos compared with OTB50 dataset. For further analysis, the dataset labeled every video with 11 attributes(illumination variation, scale variation, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out-of-view, background clutters, low resolution). We use the one pass evaluation (OPE) with success plot for evaluation which counts the number of successful frames whose overlap are larger than the given threshold. The success plot shows the ratios of successful frames at the thresholds varied from 0 to 1. To verified the robustness of the tracker, we also performed tracker on VOT2016 and UAV123 datasets. The VOT2016 dataset contains 60 sequences and the UAV123 dataset contains 123 sequences. The VOT2016 benchmark introduced the expected average overlap (EAO) to measure the expected no-reset overlap of a tracker. The videos in UAV123 dataset were captured from low-altitude UAVs. Evaluation on the UAV123 dataset is to better measure the performance of the tracker in different scenarios.

4.2 Ablation study

For an in depth analysis of the Fisher weight coefficient and Huber loss, we evaluate each component on the OTB100 dataset respectively. As can be seen from Table 1, the tracking speed are also improved by Fisher weight coefficient and Huber loss. Removing the effect of Fisher weight coefficient causes the FPS from 34.6 to 18 with a decrease of half and the AUC scores decreases about 0.9%. The Huber loss also improves the accuracy and the speed. Overall, the ablation study results demonstrate the effectiveness of the Fisher weight coefficient and Huber loss for tracking task.

Table 1 Evaluation results of ONDDLT with/without Fisher weight coefficient and Huber loss on OTB100 dataset

4.3 Quantitative analysis

To validate the proposed method, our tracker is compared with the relative trackers, including ONNDL [4], CT [14], MIL [21], L1T [11], and TLD [19]. In the evaluation, we set α = 0.05,δ = γ = ξ = 0.01,ρ = 0.99,η = 0.2,andβ = 0.005. We compared the tracking speed of trackers in Table 2 on VOT2016 dataset, our ONDDLT can achieve real-time tracking while improving the precision.

Table 2 Comparison in terms of expected average overlap (EAO), accuracy (A), and frames per second (FPS) on VOT2016

Figure 1 presents the comparison results on OTB50 and OTB100 datasets. Compared to the relative ONNDL and L1T, the performance of ONDDLT has improved a lot on OTB50 dataset, achieving the best success rate of 47.4% and improving the precision by 6.0% and 5.8%, respectively. We notice that the performance of ONDDLT has declined a lot on OTB100 dataset. However, our tracker also achieves the best performance (41.8%) on OTB100 dataset. We find that our tracker gets superior performance than ONNDL with a gain of 2.2%. Overall, our ONDDLT performs excellent against other relative trackers on public visual tracking benchmarks. For comprehensive analysis, the success plot over different video attributes annotated is presented in the OTB100 benchmark. On average, our ONDDLT performs better about 3% higher than ONNDL on all attributes. Our method is ranked top 1 on 8 attributes and top 2 on 2 attributes, which can be explained by the advantages of the global dictionary learning model and the class-specific dictionary learning model. Especially, our tracker obtains significant improvements on LR, IV, BC, and IPR shown as in Fig. 2. More results can be found in Figure 5 of Appendix A. According to the comparison results in Table 2 and Fig. 3, the performance of our ONDDLT are all top one in VOT2016 and UAV123 datasets. The accuracy of ONNDL decreases rapidly in UAV123 dataset, but our ONDDLT shows good robustness. To verify parameter robustness of trackers, we selected three important parameters that affect the tracking accuracy in Table 3. We can see that the accuracy of the tracker remains stable within a certain range of parameters. When α=0.05,β=0.005. and η=0.2, the AUC score achieves optimal accuracy.

Fig. 1
figure 1

Success plots compared with the relative trackers on the OTB50 and OTB100 datasets

Fig. 2
figure 2

Success plots of IPR, IV, LR, and BC attributes on the OTB100 dataset

Fig. 3
figure 3

Precision plots and success plots compared with the relative trackers on the UAV123 dataset

Table 3 Comparison results of the AUC score (%) on OTB50 dataset with different values of parameters α,β, and η

4.4 Qualitative analysis

The qualitative results are presented by visualization in Fig. 4. These sequences are captured under the conditions of complicated environment. The L1T loses the target on all videos and the ONNDL loses the target on Trellis, Kitesurf, and Deer. We can know that ONDDLT has stronger robustness and higher accuracy compared with the tracker based on sparse coding. As you can see from the picture, our tracker predicts more accurate position and scales than the other methods. Specifically, our method not only has better robustness and accuracy for light variation on Trellis and Man, but also has better robustness and accuracy for scale variation on Singer. Our method also performs well on fast motion (Deer) and deformation (Kitesurf) while other trackers lost the target. The reason of improvement of tracking performance is that online non-negative discriminant dictionary learning tracking strategy is used to improve the discriminative ability for matching.

Fig. 4
figure 4

Visualization performance of our method compared with other relative trackers on Trellis, Singer, Man, Kitesurf, and Deer

5 Conclusion

In this paper, online non-negative discriminative dictionary learning for tracking algorithm is proposed, which combines the advantages of the global dictionary learning model and the class-specific dictionary learning model. To this end, we explore online dictionary learning tracking algorithm and introduce the online discriminant dictionary learning tracking strategy. Especially, the Huber loss function and the Fisher weight coefficient is used in the discriminative term to improve computational efficiency. In addition, non-negative constraints on dictionaries is added to enhance the performance. The experimental results show that our method performs much better than the tracking method compared in this paper. Compared with current shallow features, deep learning can more adaptively explore the semantic features of the target. Therefore, the fusion of deep learning and sparse representation can be studied. In addition, the computational efficiency and performance of the tracking algorithm based on sparse coding can be further optimized.

6 Appendix: Evaluation results of different attributes on the OTB100 dataset

Fig. 5
figure 5

Success plots of DEF, FM, OV, MB, OCC, OPR, and SV attributes on the OTB100 dataset