21 May 2021 Channel and spatial attention-based Siamese network for visual object tracking
Shishun Tian, Zixi Chen, Bolin Chen, Wenbin Zou, Xia Li
Author Affiliations +
Abstract

Visual object tracking, which aims to estimate the position of an arbitrary target in a video sequence automatically, has drawn great attention in recent years. Many efforts have been made regarding this topic. The Siamese network, with a balanced accuracy and speed, has achieved great success. The Siamese network consists of two branches: one for the target image and the other for the search image. The position with the maximum score in the similarity map between the target and the search images indicates the place of the target image in the search image. Current Siamese trackers treat the features of different channels and spatial locations equally. However, the features of different channels and spatial locations may represent different semantic information. We propose a channel and spatial (CS) attention-based Siamese network for visual object tracking. A CS attention mechanism is inserted into the feature extractor to enhance the semantic feature learning. The experimental results show that the proposed network significantly improves the performance of the baseline tracker and is one of the top-ranked trackers among all tested state-of-the-art trackers on the most widely used visual object tracking datasets.

© 2021 SPIE and IS&T 1017-9909/2021/$28.00© 2021 SPIE and IS&T
Shishun Tian, Zixi Chen, Bolin Chen, Wenbin Zou, and Xia Li "Channel and spatial attention-based Siamese network for visual object tracking," Journal of Electronic Imaging 30(3), 033008 (21 May 2021). https://doi.org/10.1117/1.JEI.30.3.033008
Received: 2 November 2020; Accepted: 3 May 2021; Published: 21 May 2021
Lens.org Logo
CITATIONS
Cited by 4 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical tracking

Visualization

Image classification

Cameras

Convolution

Feature extraction

Neural networks

Back to Top