LSTM guided ensemble correlation filter tracking with appearance model pool

doi:10.1016/j.cviu.2020.102935

Computer Vision and Image Understanding

Volume 195, June 2020, 102935

https://doi.org/10.1016/j.cviu.2020.102935 Get rights and content

Highlights

•
We propose adaptive aggregation of CNN features from multiple layers for tracking.
•
The weights for aggregation are determined using LSTM.
•
Filters are updated using an appearance model pool to prevent faulty updates.
•
Experimental results reveal state of the art performance.

Abstract

Deep learning based visual trackers have the potential to provide good performance for object tracking. Most of them use hierarchical features learned from multiple layers of a deep network. However, issues related to deterministic aggregation of these features from various layers, difficulties in estimating variations in scale or rotation of the object being tracked, as well as challenges in effectively modeling the object’s appearance over long time periods leaves substantial scope to improve performance. In this paper, we propose a tracker that learns correlation filters over features from multiple layers of a VGG network. A correlation filter for an individual layer is used to predict the target location. We adaptively learn the contribution of an ensemble of correlation filters for the final location estimation using an LSTM. An adaptive approach is advantageous as different layers encode diverse feature representations and a uniform contribution would not fully exploit this contrastive information. To this end, we use an LSTM as it encodes the interactions for past appearances which is useful for tracking. Further, the scale and rotation parameters are estimated using respective correlation filters. Additionally, an appearance model pool is used that prevents the correlation filter from drifting. Experimental results achieved on five public datasets — Object Tracking Benchmark (OTB100), Visual Object Tracking (VOT) Benchmark 2016, VOT Benchmark 2017, Tracking Dataset and UAV123 Dataset, reveal that our approach outperforms state of the art approaches for object tracking.

Introduction

From video surveillance to human–computer interaction, visual object tracking is one of the most fundamental applications of computer vision. Recently, deep learning has been demonstrated to significantly improve tracking performance due to its rich feature representation. The majority of deep learning based trackers use pre-trained Convolutional Neural Networks (CNNs) that are trained for classification. They predict target location based on features extracted using the convolutional layers. However, the differing needs of classification and tracking algorithms mean that CNNs trained for classification often result in non-robust tracking. These difficulties are caused by the nature of the tracking problem, where video sequences offer multiple challenges like non-stationary cameras, simultaneous movement of the target and camera, environmental turbulence, non-uniform backgrounds, occlusions, background clutter and many more; often with these challenging conditions occurring at the same time. Hence, to construct a robust tracker with practical utility, one needs to simultaneously be able to overcome all of these challenges.

In this paper, we propose a deep learning based tracker that predicts the target location based on multi-layer convolutional feature aggregation, scale and rotation estimation as well as an appearance model pool. Compared to existing trackers that use the output of a single CNN layer for target location estimation, multi-layer trackers perform better (Qi et al., 2016, Ma et al., 2015a). This is because deep layers in a CNN capture only category-level semantic information, which is best suited to classification tasks. Tracking requires a feature representation that also contains spatial information for accurate target localization. Therefore, we combine activations from deep layers with those of earlier layers (that capture more spatial information), obtaining more reliable prediction. However, the prediction accuracy of each layer may vary from frame to frame and combining all the predictions with equal weight may result in inaccurate tracking. Thus, a reliable mechanism that can compute the contribution of each layer in estimating the target’s final location is required. To this end, we use a Long Short Term Memory (LSTM) network that computes the prediction accuracy of each layer, and which is used to assign a weight to each layer. The target’s final location is predicted as the weighted sum of predictions, using the above mentioned weights from different layers. However, during situations like illumination variation, occlusion or appearance change, ambiguous training samples may lead to a poor correlation filter update. This may result in tracking failure during longer sequences. To overcome this, it is imperative to introduce a corrective measure that can prevent faulty updates of the correlation filter. Thus, we introduce an appearance model pool that stores the best appearances of the object and provides the correlation filters with the best updating template during challenging scenarios. This prevents the tracker drifting during occlusion and other challenges, resulting in successful long term tracking. Further, during the course of tracking, the object can undergo scale and rotation variations. In order to address these variations, we use a strategy based on FHOG features (Felzenszwalb et al., 2010), that estimates target scale and rotation in each frame.

To summarize, our major technical contributions are as follows:

•
We introduce an LSTM network that adaptively learns the weights to combine the target location predicted by each correlation filter. This helps in improving the aggregation of predictions from different layers.
•
We introduce an appearance model based correction module that avoids faulty updates to the correlation filters. This promotes efficient long term tracking.
•
To further improve the performance, an FHOG feature (Felzenszwalb et al., 2010) based correlation filter is learned to estimate the target rotation in each frame.
•
We perform an extensive quantitative and qualitative analysis of our tracker and show competitive results on Object Tracking Benchmark (OTB100) (Wu et al., 2015), Visual Object Tracking (VOT) - 2016 Dataset (Kristan et al., 2016d), VOT-2017 Dataset (Kristan et al., 2017b), Tracking Dataset (Vojir et al., 2014) and UAV123 Dataset (Mueller et al., 2016). We also conduct an ablation study to investigate the contribution of various components of our tracker in the tracking performance, demonstrating the benefits of the proposed approach.

The rest of this paper is structured as follows. Section 2 outlines related work. Section 3 describes the proposed tracker with each module in Fig. 1 explained in detail. Section 4 shows evaluation of the LSTM component. Section 5 shows the experimental results obtained by testing the tracker on standard video datasets. It also shows the comparative analysis of the tracker with other recent trackers. Section 6 presents an ablation study that shows the contribution of each module to the overall tracker performance and Section 7 concludes the paper.

Section snippets

Related work

Visual Object Tracking is an active area of research in the field of computer vision. Tracking by detection is one of the popular ways to achieve this task, where a binary classifier is learned to discriminate between the target (foreground) and background. These classifiers are usually learned online and updated incrementally. Unfortunately, such trackers are prone to model drift due to sampling ambiguity problems during updates. Although many improvements are suggested to overcome this issue (

Proposed approach

We propose a CNN feature based tracker that, in parallel, learns multiple correlation filters over CNN features extracted from the hierarchical convolutional layers of a VGG-Net (Simonyan and Zisserman, 2014). Each correlation filter response contributes to estimating the position of the target in a frame. In addition, to estimate the target’s scale and rotation, correlation filters are learned over FHOG features (Felzenszwalb et al., 2010). For target localization, robust and rich features are

Evaluation of the LSTM component

The LSTM in Section 3.3 can be trained using either HOG features or CNN features extracted from VGG-19-Net. In order to compare the VGG features with HOG features, we perform two experiments. In the first one, we train the LSTM using VGG features and in the second we train the LSTM using HOG features. The details of the experiments are given below:

Experiments

The proposed algorithm is tested over challenging videos from the OTB100 Dataset (Wu et al., 2015), Visual Object Tracking (VOT) - 2016 Dataset (Kristan et al., 2016d), VOT-2017 Dataset (Kristan et al., 2017b), Tracking Dataset (Vojir et al., 2014) and UAV123 Dataset (Mueller et al., 2016). Out of over 70 trackers submitted in the VOT-2016 challenge, the comparison is done against the top ranked trackers (TRACA (Choi et al., 2018a), CCOT (Danelljan et al., 2016b), MLDF (Kristan et al., 2016b),

Ablation study

This section shows how each module in the proposed tracker contributes to the overall performance. We discuss here the performance of 7 variants of the proposed tracker. Fig. 20 shows the success and precision plots, and Table 14 shows the average IOU, average CLE, OS rate and DP rate obtained for each, when evaluated over the VOT-2016 dataset. Details of each variant are as follows.

•
The version $U n i f o r m - H O G$ has the location correlation filters learned over features extracted from convolutional

Conclusion

The paper proposes a deep learning based object tracking algorithm with the aim to improve the performance over the current deep learning based approaches. Most of these trackers use hierarchical features learned from multiple layers of a deep network and face several issues related to aggregation of the hierarchical features from various layers, difficulties in estimating variations in scale of the object being tracked as well as challenges in effectively modeling the object’s appearance

CRediT authorship contribution statement

Monika Jain: Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization. Subramanyam A.V.: Conceptualization, Methodology, Supervision, Project administration, Funding acquisition, Resources, Writing - review & editing. Simon Denman: Conceptualization, Methodology, Supervision, Project administration, Funding acquisition, Resources, Writing - review & editing. Sridha Sridharan: Conceptualization, Methodology,

Acknowledgments

This research is supported by the Australian Research Council (ARC) Linkage Grant [Grant Number LP140100282] and Early Career Research Award, Department of Science and Technology, Government of India [Grant Number ECR/2018/002449].

Monika Jain is a Ph.D. student in Speech, Audio, Image and Video Technology (SAIVT) Laboratory at Queensland University of Technology (QUT), Australia and Indraprastha Institute of Information Technology (IIIT), Delhi, India. She received her Bachelor of Technology from Uttar Pradesh Technical University, India and Master of Technology with first class honors from Institute of Engineering and Technology, Lucknow, India. Her research focuses on Visual Object Tracking.

References (96)

ChoiJ. et al.
Real-time visual tracking by deep reinforced decision making
Comput. Vis. Image Underst.
(2018)
De MaesschalckR. et al.
The mahalanobis distance
Chemom. Intell. Lab. Syst.
(2000)
LiH. et al.
Convolutional neural net bagging for online visual tracking
Comput. Vis. Image Underst.
(2016)
LiH. et al.
Convolutional neural net bagging for online visual tracking
Comput. Vis. Image Underst.
(2016)
LiP. et al.
Deep visual tracking: review and experimental comparison
Pattern Recognit.
(2018)
QianX. et al.
Deep learning assisted robust visual tracking with adaptive particle filtering
Signal Process., Image Commun.
(2018)
VojirT. et al.
Robust scale-adaptive mean-shift for tracking
Pattern Recognit. Lett.
(2014)
WangJ. et al.
Visual object tracking with multi-scale superpixels and color-feature guided kernelized correlation filters
Signal Process., Image Commun.
(2018)
ZhangL. et al.
Robust visual tracking via co-trained kernelized correlation filters
Pattern Recognit.
(2017)
ZhuG. et al.
Clustering based ensemble correlation tracking
Comput. Vis. Image Underst.
(2016)

Adam, A., Rivlin, E., Shimshoni, I., 2006. Robust fragments-based tracking using the integral histogram. In:...

BabenkoB. et al.

Robust object tracking with online multiple instance learning

IEEE Trans. Pattern Anal. Mach. Intell.

(2010)

BabenkoB. et al.

Robust object tracking with online multiple instance learning

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

BaoC. et al.

Real time robust l1 tracker using accelerated proximal gradient approach

Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H., 2016a. Staple: Complementary learners for real-time...

BertinettoL. et al.

Fully-convolutional siamese networks for object tracking

Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M., 2010. Visual object tracking using adaptive correlation filters....

CehovinL. et al.

Robust visual tracking using an adaptive coupled-layer visual model

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

ČehovinL. et al.

Robust visual tracking using template anchors

ČehovinL. et al.

Visual object tracking performance measures revisited

IEEE Trans. Image Process.

(2016)

ChenK. et al.

Convolutional regression for visual tracking

IEEE Trans. Image Process.

(2018)

ChiZ. et al.

Dual deep network for visual tracking

IEEE Trans. Image Process.

(2017)

Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y., Choi, J.Y., 2018a. Context-aware deep...

DalalN. et al.

Histograms of oriented gradients for human detection

DanelljanM. et al.

ECO: efficient convolution operators for tracking

(2016)

DanelljanM. et al.

Accurate scale estimation for robust visual tracking

DanelljanM. et al.

Discriminative scale space tracking

IEEE Trans. Pattern Anal. Mach. Intell.

(2017)

Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M., 2015a. Convolutional features for correlation filter based...

Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M., 2015b. Learning spatially regularized correlation filters for...

DanelljanM. et al.

Beyond correlation filters: learning continuous convolution operators for visual tracking

DanelljanM. et al.

Adaptive color attributes for real-time visual tracking

FanJ. et al.

Human tracking using convolutional neural networks

IEEE Trans. Neural Netw.

(2010)

FelzenszwalbP. et al.

Object detection with discriminatively trained part based models

IEEE Trans. Pattern Anal. Mach. Intell.

(2010)

GanQ. et al.

First step toward model-free, anonymous object tracking with recurrent neural networks

(2015)

GaoJ. et al.

Transfer learning based visual tracking with gaussian processes regression

GirshickR. et al.

Rich feature hierarchies for accurate object detection and semantic segmentation

GordonD. et al.

Re³: real-time recurrent regression networks for visual tracking of generic objects

IEEE Robot. Autom. Lett.

(2018)

GrabnerH. et al.

Semi-supervised on-line boosting for robust tracking

HareS. et al.

Struck: structured output tracking with kernels

IEEE Trans. Pattern Anal. Mach. Intell.

(2016)

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE...

HenriquesJ.F. et al.

Exploiting the circulant structure of tracking-by-detection with kernels

HenriquesJ.F. et al.

High-speed tracking with kernelized correlation filters

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

HenriquesJ.F. et al.

High-speed tracking with kernelized correlation filters

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

Hong, S., You, T., Kwak, S., Han, B., 2015. Online tracking by learning discriminative saliency map with convolutional...

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T., 2014. Caffe:...

KahouS.E. et al.

RATM: recurrent attentive tracking model

KalalZ. et al.

Tracking-learning-detection

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

Kiani Galoogahi, H., Fagg, A., Lucey, S., 2017. Learning background-aware correlation filters for visual tracking. In:...

Cited by (8)

Deep clustering framework review using multicriteria evaluation
2024, Knowledge-Based Systems
The application of clustering has always been an important method for problem-solving. In the era of big data, most classical clustering methods suffer from the curse of dimensionality and scalability issues. Recently, deep clustering models have garnered more attention due to their capabilities in dealing with complex, high-dimensional, and large-scale datasets. They offer intriguing perspectives owing to their outstanding representative capacity and fast inference speed. The remaining major problem in clustering scenarios with high-dimensional data revolves around determining an appropriately compressed representation that semantically preserves cluster structures. Without labels, defining an objective function to encourage a suitable representation becomes a critical question. After several years of stagnation, impressive results have been achieved in the last two years. This paper proposes a comprehensive and up-to-date review of deep clustering methods. We first introduce the basic concepts shared by several deep clustering algorithms, available network architectures, and optimization strategies. Then, a detailed review is presented for each family by analyzing their most representative algorithms. These algorithms are then assessed based on their classification accuracy and from a multi-criteria perspective to aid investigators in selecting the most appropriate solution. Finally, an overview of the diversity of tasks and application domains is provided, and current issues and challenges are discussed.
Multi-object tracking with robust object regression and association
2023, Computer Vision and Image Understanding
Citation Excerpt :
Besides, Some methods adopt motion models, such as the Kalman filter (Wojke et al., 2017; Zhang et al., 2021), optical flow (Tang et al., 2017), and motion prediction networks (Zhou et al., 2020; Sadeghian et al., 2017; Wang et al., 2022), that incorporate temporal features to make dynamic position predictions to compensate for noisy detections. Some methods establish Recurrent Neural Networks (Milan et al., 2017; Sadeghian et al., 2017; Jain et al., 2020) to model complex motion patterns. Moreover, data association is also formulated as a graph optimization problem in some methods (Li et al., 2022) and solved globally with network flow (Schulter et al., 2017) and Multiple Hypothesis Tracking (Kim et al., 2015) frameworks.
Tracking-by-regression is a new paradigm for online Multi-Object Tracking (MOT). It unifies detection and tracking into a single network by associating targets through regression, significantly reducing the complexity of data association. However, owing to noisy features from nearby occlusions and distractors, the regression is vulnerable and unaware of the inter-object occlusions and intra-class distractors. Thus the regressed bounding boxes can be wrongly suppressed or easily drift. Meanwhile, the commonly used bounding box-based post-processing is unable to remedy false negatives and false assignments caused by regression. To address these challenges, we present to leverage regression tubes as input for the regression-based tracker, which provides spatial–temporal information to enhance the tracking performance. Specially, we propose a novel tube re-localization strategy that obtains robust regressions and recovers missed targets. A tube-based NMS (T-NMS) strategy to manage the regressions at the tube level is also proposed, including a tube IoU (T-IoU) scheme for measuring positional relation and tube re-scoring (T-RS) to evaluate the quality of candidate tubes. Finally, a tube re-assignment strategy is further employed for robust cost measurement and to revise false assignments using motion cues. We evaluate our method on benchmarks, including MOT16, MOT17, and MOT20. The results show that our method can significantly improve the baseline, mitigate the challenges of the regression-based tracker, and achieve very competitive tracking performance.
Introducing Depth Information Into Generative Target Tracking
2021, Frontiers in Neurorobotics
Study on Driving Behavior Detection Method Based on Improved Long and Short-term Memory Network
2021, Qiche Gongcheng/Automotive Engineering
Advances in visual object tracking algorithm based on correlation filter
2021, Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics
High-precision time delay estimation of narrowband radio signal by PHAT-LSTM
2021, Measurement Science and Technology

View all citing articles on Scopus

Professor Subramanyam A.V.is an Assistant Professor in Electronics and Communication Engineering, and Computer Science Engineering at IIIT, Delhi, India. He completed his PhD at Nanyang Technological University, Singapore and undergraduate studies at Indian School of Mines University, Dhanbad, India. His research interests lie in the area of Multimedia Security, Information Hiding and Forensics. Presently, his work is focused in analyzing images or videos to determine its processing history for the purpose of authentication, copyright violation detection and fingerprinting.

Dr. Simon Denman received a B.Eng. (Electrical), BIT, and Ph.D. in the area of object tracking from the QUT in Brisbane, Australia. He is currently a Senior Research Fellow with the SAIVT Laboratory at QUT. His active areas of research include intelligent surveillance, video analytics, and video-based recognition.

Professor Sridha Sridharan has a B.Sc.(Electrical Engineering) and obtained an M.Sc.(Communication Engineering) from the University Of Manchester, UK and a Ph.D. from University of New South Wales, Australia. He is a Professor in School Electrical Engineering and Computer Science, QUT, leading the Research Program in SAIVT with focus in areas of computer vision, pattern recognition and machine learning. He published over 500 journals and refereed international conference papers and graduated 60 Ph.D. students in the areas of Image and Speech technologies during 1990–2016. He received multiple research grants including Commonwealth competitive funding. Several of his researches have been commercialized.

Professor Clinton Fookes is a Professor in Vision Signal Processing and the SAIVT group at QUT. He holds a B.Eng. (Aerospace/Avionics), an MBA (Technology innovation/Management), and a Ph.D. in the field of computer vision. His research areas are computer vision, video surveillance, biometrics, human–computer interaction, airport security and operations. Clinton has attracted over $15M of cash funding from external competitive sources. He published over 140 internationally peer-reviewed articles. He is currently Head of Discipline for Vision Signal Processing, the Technical Director for the Airports of the Future collaborative research initiatives, a Senior Member of the IEEE.

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.cviu.2020.102935.

View full text

LSTM guided ensemble correlation filter tracking with appearance model pool☆

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed approach

Evaluation of the LSTM component

Experiments

Ablation study

Conclusion

CRediT authorship contribution statement

Acknowledgments

Comput. Vis. Image Underst.

Chemom. Intell. Lab. Syst.

Comput. Vis. Image Underst.

Comput. Vis. Image Underst.

Pattern Recognit.

Signal Process., Image Commun.

Pattern Recognit. Lett.

Signal Process., Image Commun.

Pattern Recognit.

Comput. Vis. Image Underst.

Robust object tracking with online multiple instance learning

IEEE Trans. Pattern Anal. Mach. Intell.

Robust object tracking with online multiple instance learning

IEEE Trans. Pattern Anal. Mach. Intell.

Real time robust l1 tracker using accelerated proximal gradient approach

Fully-convolutional siamese networks for object tracking

Robust visual tracking using an adaptive coupled-layer visual model

IEEE Trans. Pattern Anal. Mach. Intell.

Robust visual tracking using template anchors

Visual object tracking performance measures revisited

IEEE Trans. Image Process.

Convolutional regression for visual tracking

IEEE Trans. Image Process.

Dual deep network for visual tracking

IEEE Trans. Image Process.

Histograms of oriented gradients for human detection

ECO: efficient convolution operators for tracking

Accurate scale estimation for robust visual tracking

Discriminative scale space tracking

IEEE Trans. Pattern Anal. Mach. Intell.

Beyond correlation filters: learning continuous convolution operators for visual tracking

Adaptive color attributes for real-time visual tracking

Human tracking using convolutional neural networks

IEEE Trans. Neural Netw.

Object detection with discriminatively trained part based models

IEEE Trans. Pattern Anal. Mach. Intell.

First step toward model-free, anonymous object tracking with recurrent neural networks

Transfer learning based visual tracking with gaussian processes regression

Rich feature hierarchies for accurate object detection and semantic segmentation

Re3: real-time recurrent regression networks for visual tracking of generic objects

IEEE Robot. Autom. Lett.

Semi-supervised on-line boosting for robust tracking

Struck: structured output tracking with kernels

IEEE Trans. Pattern Anal. Mach. Intell.

Exploiting the circulant structure of tracking-by-detection with kernels

High-speed tracking with kernelized correlation filters

IEEE Trans. Pattern Anal. Mach. Intell.

High-speed tracking with kernelized correlation filters

IEEE Trans. Pattern Anal. Mach. Intell.

RATM: recurrent attentive tracking model

Tracking-learning-detection

IEEE Trans. Pattern Anal. Mach. Intell.

Re³: real-time recurrent regression networks for visual tracking of generic objects