LSTM guided ensemble correlation filter tracking with appearance model pool

https://doi.org/10.1016/j.cviu.2020.102935Get rights and content

Highlights

  • We propose adaptive aggregation of CNN features from multiple layers for tracking.

  • The weights for aggregation are determined using LSTM.

  • Filters are updated using an appearance model pool to prevent faulty updates.

  • Experimental results reveal state of the art performance.

Abstract

Deep learning based visual trackers have the potential to provide good performance for object tracking. Most of them use hierarchical features learned from multiple layers of a deep network. However, issues related to deterministic aggregation of these features from various layers, difficulties in estimating variations in scale or rotation of the object being tracked, as well as challenges in effectively modeling the object’s appearance over long time periods leaves substantial scope to improve performance. In this paper, we propose a tracker that learns correlation filters over features from multiple layers of a VGG network. A correlation filter for an individual layer is used to predict the target location. We adaptively learn the contribution of an ensemble of correlation filters for the final location estimation using an LSTM. An adaptive approach is advantageous as different layers encode diverse feature representations and a uniform contribution would not fully exploit this contrastive information. To this end, we use an LSTM as it encodes the interactions for past appearances which is useful for tracking. Further, the scale and rotation parameters are estimated using respective correlation filters. Additionally, an appearance model pool is used that prevents the correlation filter from drifting. Experimental results achieved on five public datasets — Object Tracking Benchmark (OTB100), Visual Object Tracking (VOT) Benchmark 2016, VOT Benchmark 2017, Tracking Dataset and UAV123 Dataset, reveal that our approach outperforms state of the art approaches for object tracking.

Introduction

From video surveillance to human–computer interaction, visual object tracking is one of the most fundamental applications of computer vision. Recently, deep learning has been demonstrated to significantly improve tracking performance due to its rich feature representation. The majority of deep learning based trackers use pre-trained Convolutional Neural Networks (CNNs) that are trained for classification. They predict target location based on features extracted using the convolutional layers. However, the differing needs of classification and tracking algorithms mean that CNNs trained for classification often result in non-robust tracking. These difficulties are caused by the nature of the tracking problem, where video sequences offer multiple challenges like non-stationary cameras, simultaneous movement of the target and camera, environmental turbulence, non-uniform backgrounds, occlusions, background clutter and many more; often with these challenging conditions occurring at the same time. Hence, to construct a robust tracker with practical utility, one needs to simultaneously be able to overcome all of these challenges.

In this paper, we propose a deep learning based tracker that predicts the target location based on multi-layer convolutional feature aggregation, scale and rotation estimation as well as an appearance model pool. Compared to existing trackers that use the output of a single CNN layer for target location estimation, multi-layer trackers perform better (Qi et al., 2016, Ma et al., 2015a). This is because deep layers in a CNN capture only category-level semantic information, which is best suited to classification tasks. Tracking requires a feature representation that also contains spatial information for accurate target localization. Therefore, we combine activations from deep layers with those of earlier layers (that capture more spatial information), obtaining more reliable prediction. However, the prediction accuracy of each layer may vary from frame to frame and combining all the predictions with equal weight may result in inaccurate tracking. Thus, a reliable mechanism that can compute the contribution of each layer in estimating the target’s final location is required. To this end, we use a Long Short Term Memory (LSTM) network that computes the prediction accuracy of each layer, and which is used to assign a weight to each layer. The target’s final location is predicted as the weighted sum of predictions, using the above mentioned weights from different layers. However, during situations like illumination variation, occlusion or appearance change, ambiguous training samples may lead to a poor correlation filter update. This may result in tracking failure during longer sequences. To overcome this, it is imperative to introduce a corrective measure that can prevent faulty updates of the correlation filter. Thus, we introduce an appearance model pool that stores the best appearances of the object and provides the correlation filters with the best updating template during challenging scenarios. This prevents the tracker drifting during occlusion and other challenges, resulting in successful long term tracking. Further, during the course of tracking, the object can undergo scale and rotation variations. In order to address these variations, we use a strategy based on FHOG features (Felzenszwalb et al., 2010), that estimates target scale and rotation in each frame.

To summarize, our major technical contributions are as follows:

  • We introduce an LSTM network that adaptively learns the weights to combine the target location predicted by each correlation filter. This helps in improving the aggregation of predictions from different layers.

  • We introduce an appearance model based correction module that avoids faulty updates to the correlation filters. This promotes efficient long term tracking.

  • To further improve the performance, an FHOG feature (Felzenszwalb et al., 2010) based correlation filter is learned to estimate the target rotation in each frame.

  • We perform an extensive quantitative and qualitative analysis of our tracker and show competitive results on Object Tracking Benchmark (OTB100) (Wu et al., 2015), Visual Object Tracking (VOT) - 2016 Dataset (Kristan et al., 2016d), VOT-2017 Dataset (Kristan et al., 2017b), Tracking Dataset (Vojir et al., 2014) and UAV123 Dataset (Mueller et al., 2016). We also conduct an ablation study to investigate the contribution of various components of our tracker in the tracking performance, demonstrating the benefits of the proposed approach.

The rest of this paper is structured as follows. Section 2 outlines related work. Section 3 describes the proposed tracker with each module in Fig. 1 explained in detail. Section 4 shows evaluation of the LSTM component. Section 5 shows the experimental results obtained by testing the tracker on standard video datasets. It also shows the comparative analysis of the tracker with other recent trackers. Section 6 presents an ablation study that shows the contribution of each module to the overall tracker performance and Section 7 concludes the paper.

Section snippets

Related work

Visual Object Tracking is an active area of research in the field of computer vision. Tracking by detection is one of the popular ways to achieve this task, where a binary classifier is learned to discriminate between the target (foreground) and background. These classifiers are usually learned online and updated incrementally. Unfortunately, such trackers are prone to model drift due to sampling ambiguity problems during updates. Although many improvements are suggested to overcome this issue (

Proposed approach

We propose a CNN feature based tracker that, in parallel, learns multiple correlation filters over CNN features extracted from the hierarchical convolutional layers of a VGG-Net (Simonyan and Zisserman, 2014). Each correlation filter response contributes to estimating the position of the target in a frame. In addition, to estimate the target’s scale and rotation, correlation filters are learned over FHOG features (Felzenszwalb et al., 2010). For target localization, robust and rich features are

Evaluation of the LSTM component

The LSTM in Section 3.3 can be trained using either HOG features or CNN features extracted from VGG-19-Net. In order to compare the VGG features with HOG features, we perform two experiments. In the first one, we train the LSTM using VGG features and in the second we train the LSTM using HOG features. The details of the experiments are given below:

Experiments

The proposed algorithm is tested over challenging videos from the OTB100 Dataset (Wu et al., 2015), Visual Object Tracking (VOT) - 2016 Dataset (Kristan et al., 2016d), VOT-2017 Dataset (Kristan et al., 2017b), Tracking Dataset (Vojir et al., 2014) and UAV123 Dataset (Mueller et al., 2016). Out of over 70 trackers submitted in the VOT-2016 challenge, the comparison is done against the top ranked trackers (TRACA (Choi et al., 2018a), CCOT (Danelljan et al., 2016b), MLDF (Kristan et al., 2016b),

Ablation study

This section shows how each module in the proposed tracker contributes to the overall performance. We discuss here the performance of 7 variants of the proposed tracker. Fig. 20 shows the success and precision plots, and Table 14 shows the average IOU, average CLE, OS rate and DP rate obtained for each, when evaluated over the VOT-2016 dataset. Details of each variant are as follows.

  • The version UniformHOG has the location correlation filters learned over features extracted from convolutional

Conclusion

The paper proposes a deep learning based object tracking algorithm with the aim to improve the performance over the current deep learning based approaches. Most of these trackers use hierarchical features learned from multiple layers of a deep network and face several issues related to aggregation of the hierarchical features from various layers, difficulties in estimating variations in scale of the object being tracked as well as challenges in effectively modeling the object’s appearance

CRediT authorship contribution statement

Monika Jain: Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization. Subramanyam A.V.: Conceptualization, Methodology, Supervision, Project administration, Funding acquisition, Resources, Writing - review & editing. Simon Denman: Conceptualization, Methodology, Supervision, Project administration, Funding acquisition, Resources, Writing - review & editing. Sridha Sridharan: Conceptualization, Methodology,

Acknowledgments

This research is supported by the Australian Research Council (ARC) Linkage Grant [Grant Number LP140100282] and Early Career Research Award, Department of Science and Technology, Government of India [Grant Number ECR/2018/002449].

Monika Jain is a Ph.D. student in Speech, Audio, Image and Video Technology (SAIVT) Laboratory at Queensland University of Technology (QUT), Australia and Indraprastha Institute of Information Technology (IIIT), Delhi, India. She received her Bachelor of Technology from Uttar Pradesh Technical University, India and Master of Technology with first class honors from Institute of Engineering and Technology, Lucknow, India. Her research focuses on Visual Object Tracking.

References (96)

  • Adam, A., Rivlin, E., Shimshoni, I., 2006. Robust fragments-based tracking using the integral histogram. In:...
  • BabenkoB. et al.

    Robust object tracking with online multiple instance learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • BabenkoB. et al.

    Robust object tracking with online multiple instance learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • BaoC. et al.

    Real time robust l1 tracker using accelerated proximal gradient approach

  • Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H., 2016a. Staple: Complementary learners for real-time...
  • BertinettoL. et al.

    Fully-convolutional siamese networks for object tracking

  • Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M., 2010. Visual object tracking using adaptive correlation filters....
  • CehovinL. et al.

    Robust visual tracking using an adaptive coupled-layer visual model

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • ČehovinL. et al.

    Robust visual tracking using template anchors

  • ČehovinL. et al.

    Visual object tracking performance measures revisited

    IEEE Trans. Image Process.

    (2016)
  • ChenK. et al.

    Convolutional regression for visual tracking

    IEEE Trans. Image Process.

    (2018)
  • ChiZ. et al.

    Dual deep network for visual tracking

    IEEE Trans. Image Process.

    (2017)
  • Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y., Choi, J.Y., 2018a. Context-aware deep...
  • DalalN. et al.

    Histograms of oriented gradients for human detection

  • DanelljanM. et al.

    ECO: efficient convolution operators for tracking

    (2016)
  • DanelljanM. et al.

    Accurate scale estimation for robust visual tracking

  • DanelljanM. et al.

    Discriminative scale space tracking

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M., 2015a. Convolutional features for correlation filter based...
  • Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M., 2015b. Learning spatially regularized correlation filters for...
  • DanelljanM. et al.

    Beyond correlation filters: learning continuous convolution operators for visual tracking

  • DanelljanM. et al.

    Adaptive color attributes for real-time visual tracking

  • FanJ. et al.

    Human tracking using convolutional neural networks

    IEEE Trans. Neural Netw.

    (2010)
  • FelzenszwalbP. et al.

    Object detection with discriminatively trained part based models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • GanQ. et al.

    First step toward model-free, anonymous object tracking with recurrent neural networks

    (2015)
  • GaoJ. et al.

    Transfer learning based visual tracking with gaussian processes regression

  • GirshickR. et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

  • GordonD. et al.

    Re3: real-time recurrent regression networks for visual tracking of generic objects

    IEEE Robot. Autom. Lett.

    (2018)
  • GrabnerH. et al.

    Semi-supervised on-line boosting for robust tracking

  • HareS. et al.

    Struck: structured output tracking with kernels

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE...
  • HenriquesJ.F. et al.

    Exploiting the circulant structure of tracking-by-detection with kernels

  • HenriquesJ.F. et al.

    High-speed tracking with kernelized correlation filters

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • HenriquesJ.F. et al.

    High-speed tracking with kernelized correlation filters

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • Hong, S., You, T., Kwak, S., Han, B., 2015. Online tracking by learning discriminative saliency map with convolutional...
  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T., 2014. Caffe:...
  • KahouS.E. et al.

    RATM: recurrent attentive tracking model

  • KalalZ. et al.

    Tracking-learning-detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • Kiani Galoogahi, H., Fagg, A., Lucey, S., 2017. Learning background-aware correlation filters for visual tracking. In:...
  • Cited by (8)

    • Multi-object tracking with robust object regression and association

      2023, Computer Vision and Image Understanding
      Citation Excerpt :

      Besides, Some methods adopt motion models, such as the Kalman filter (Wojke et al., 2017; Zhang et al., 2021), optical flow (Tang et al., 2017), and motion prediction networks (Zhou et al., 2020; Sadeghian et al., 2017; Wang et al., 2022), that incorporate temporal features to make dynamic position predictions to compensate for noisy detections. Some methods establish Recurrent Neural Networks (Milan et al., 2017; Sadeghian et al., 2017; Jain et al., 2020) to model complex motion patterns. Moreover, data association is also formulated as a graph optimization problem in some methods (Li et al., 2022) and solved globally with network flow (Schulter et al., 2017) and Multiple Hypothesis Tracking (Kim et al., 2015) frameworks.

    • Advances in visual object tracking algorithm based on correlation filter

      2021, Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics
    View all citing articles on Scopus

    Monika Jain is a Ph.D. student in Speech, Audio, Image and Video Technology (SAIVT) Laboratory at Queensland University of Technology (QUT), Australia and Indraprastha Institute of Information Technology (IIIT), Delhi, India. She received her Bachelor of Technology from Uttar Pradesh Technical University, India and Master of Technology with first class honors from Institute of Engineering and Technology, Lucknow, India. Her research focuses on Visual Object Tracking.

    Professor Subramanyam A.V.is an Assistant Professor in Electronics and Communication Engineering, and Computer Science Engineering at IIIT, Delhi, India. He completed his PhD at Nanyang Technological University, Singapore and undergraduate studies at Indian School of Mines University, Dhanbad, India. His research interests lie in the area of Multimedia Security, Information Hiding and Forensics. Presently, his work is focused in analyzing images or videos to determine its processing history for the purpose of authentication, copyright violation detection and fingerprinting.

    Dr. Simon Denman received a B.Eng. (Electrical), BIT, and Ph.D. in the area of object tracking from the QUT in Brisbane, Australia. He is currently a Senior Research Fellow with the SAIVT Laboratory at QUT. His active areas of research include intelligent surveillance, video analytics, and video-based recognition.

    Professor Sridha Sridharan has a B.Sc.(Electrical Engineering) and obtained an M.Sc.(Communication Engineering) from the University Of Manchester, UK and a Ph.D. from University of New South Wales, Australia. He is a Professor in School Electrical Engineering and Computer Science, QUT, leading the Research Program in SAIVT with focus in areas of computer vision, pattern recognition and machine learning. He published over 500 journals and refereed international conference papers and graduated 60 Ph.D. students in the areas of Image and Speech technologies during 1990–2016. He received multiple research grants including Commonwealth competitive funding. Several of his researches have been commercialized.

    Professor Clinton Fookes is a Professor in Vision Signal Processing and the SAIVT group at QUT. He holds a B.Eng. (Aerospace/Avionics), an MBA (Technology innovation/Management), and a Ph.D. in the field of computer vision. His research areas are computer vision, video surveillance, biometrics, human–computer interaction, airport security and operations. Clinton has attracted over $15M of cash funding from external competitive sources. He published over 140 internationally peer-reviewed articles. He is currently Head of Discipline for Vision Signal Processing, the Technical Director for the Airports of the Future collaborative research initiatives, a Senior Member of the IEEE.

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.cviu.2020.102935.

    View full text