Local–Global Attentive Adaptation for Object Detection
Introduction
Object detection is a fundamental research field in computer vision. It aims to locate and identify the interested regions in an image simultaneously. With the rapid development of deep learning in recent years (Schmidhuber, 2015), deep learning-based object detection has made a significant breakthrough and becomes the mainstream method, such as Fast RCNN (Girshick, 2015), Faster RCNN (Ren et al., 2015), RetinaNet (Lin et al., 2017), Cascaded R-CNN (Cai and Vasconcelos, 2018), SSD (Liu et al., 2016), YOLO (Redmon et al., 2016) and memory-based detector (Li et al., 2017a), etc. The deep network training has a great challenge, i.e., it requires large amounts of labeled training data. Hence, a lot of manpower and financial resources will be wasted. Another drawback of deep networks is that it assumes the training data and the test data follow the same distribution – a rarely real condition since the source domain data and the target domain data are actually collected in different environments (refer to Fig. 1).
To this end, numerous unsupervised domain adaption (UDA) methods have been proposed to address the challenges involved above. The goal of UDA is to train a model based on the labeled source dataset, which can be generalized and applied to another related but unlabeled target dataset with different distributions. As for object detection situation, the problem is more intractable as it not only needs to identify but also requires to locate multiple objects in an image. Recently, many domain adaptive methods for object detection are proposed based on the adversarial paradigm (Chen et al., 2018, Wang et al., 2019b, Zhang et al., 2019a, Xu et al., 2020). These approaches force the feature extractor to extract the domain invariant features which cannot be distinguished by a domain discriminator. For instance, the method in Chen et al. (2018) and Xu et al. (2020) tries to align the features between the source and target domains at both global-level and instance-level features. Wang et al. (2019b) and Zhang et al. (2019a) use CycleGAN (Zhu et al., 2017) to force the alignment of the distribution between the two domains at the pixel and feature levels. Although these methods can reduce the distance between the two domains to a certain extent, straightforward feature matching may derivatives some domain deviation problems, because not all source data can be used for feature matching to transfer domain knowledge. Consequently, this kind of feature matching may only work well for two domains with a small gap. When the discrepancy of the distribution between domains becomes large, naively matching all features and images between domains will cause the notorious phenomena ‘negative transfer’.
Fortunately, a part of recent work in domain adaptive object detection has noticed the problem of full feature matching. At the global-level matching, selecting good images for transfer is realized in Saito et al. (2019) and Xie et al. (2019). Xie et al. (2019) aligns domains at multiple levels. Saito et al. (2019) uses the (FL) function (Lin et al., 2017) to perform weak alignment on global-level features and leverages an adversarial loss to perform strong alignment on local-level features(i.e., the regions on low-level features). However, the FL function’s focus is only to reduce the influence of many simple negative samples during training and to neglect the images with a significant domain divergence. The trained detection model will then not perform well on such images. Besides, only naive generative adversarial strategy is used for aligning the multi-level features in these methods, which cannot effectively reduce the cross-domain gap. Unlike them, the method in Zhu et al. (2019) only mines the significant instances to transfer. However, these instances are aligned at the ROI-Pooling based features, while the instances are independent representations of region features without information interaction, and the location accuracy of the instance cannot be guaranteed.
Based on the above discussion, we can find that the existing domain adaptive object detection methods typically aim to forcibly match the feature distribution of two domains at single or multiple levels. They do not consider the practical problem: not all the local and global features have the same transferability. In addition, the contribution of instance and background features in each image is also different for domain transfer. In order to address the above problems, we introduce an attention mechanism strategy (Vaswani et al., 2017, Wang et al., 2017a) in the Faster RCNN model (Ren et al., 2015) to explore the application of attention mechanism in adaptative object detection to alleviate the challenges above. We mainly apply the attention mechanism to local and global levels. For local-level features, we use the local-level attention generated by multiple domain discriminators to highlight the local features with large domain discrepancy, so that the knowledge of two domains can be learned collaboratively better to promote domain knowledge transfer than the previous methods. For global-level features, we use the attention generated by the FL function (Lin et al., 2017) to focus on the objects (i.e., instances) instance of the whole image feature or background feature. The global-level features designed based on the following basics: the context vectors from local-level and global-level features, which contain the information about the whole image, are fused to the instance features to induce the subsequent detection. However, for difficult transferable images, we should pay more attention to the objects (i.e., instances) in them and place less emphasis on the context that cannot be transferred, since the background of the difficult samples may cause negative transfer, which was not considered in previous studies (Saito et al., 2019, Wang et al., 2019a). Thus the attention value is incorporated into the ROI-Pooling based features to focus more on the instances than the context. In general, our contributions can be summarized as follows.
(1) A practical framework is proposed for domain adaptive object detection. The distribution divergence of the local-level features between different domains can be reduced, and some reliable images are selected for the global-level feature alignment.
(2) Attention is proposed for selection. The local attention is used to promote collaborative learning between two domains. And the instances in difficult images are focused by the attention mechanism.
(4) Extensive experiments on multiple widely-used datasets demonstrate the effectiveness of our method.
We will detail our LGAAD in Section 3. Before that, some related works will be reviewed in Section 2. Experimental results are reported in Section 4, and Section 5 is the conclusion.
Section snippets
Related work
Object Detection. With the rapid development of deep technology in recent years, deep learning is used in more applications. For example, Altan et al. (2019) uses long short-term memory neural network and empirical wavelet transform to forecast the digital currencies, Altan and Karasu (2020) combine with deep learning technology to recognize COVID-19 disease from X-ray images, and the task of Karasu et al. (2020) is to forecast the price of crude oil with high precision. While the object
Method
Our proposed method highlights the local features with larger domain discrepancy by introducing a local-level attention mechanism to significantly reduce the gap between the two domains, and utilize FL function (Lin et al., 2017) to focus on the transferable images in two domains to alleviate the negative transfer caused by improper global-feature alignments. Meanwhile, we utilize context vectors to induce the subsequent detection to minimize the detection loss. and a global-level attention
Experiments
In this section, we conduct two types of experiments to verify the effectiveness of our proposed method, i.e., the domain adaptation with small and large cross-domain discrepancy. The experimental results obtained are compared with other outstanding methods, which verifies the superiority of our method. In the end, we illustrate the effectiveness of our proposed method through ablation analysis, visualization of attention map, qualitative analysis, and feature visualization.
Conclusions
In this paper, we propose a domain adaptive framework for object detection based on the attention mechanism. On the one hand, the compulsive alignment of local features between the source and target domains reduces the domain gap without changing category-level semantic information. On the other hand, matching the transferable images between the source domain and target domain using the FL function alleviates the negative transfer caused by improper image alignments. Meanwhile, we utilize
CRediT authorship contribution statement
Dan Zhang: Methodology, Writing - original draft, Software. Jingjing Li: Conceptualization, Software. Xingpeng Li: Data curation, Validation. Zhekai Du: Supervision, Writing - review & editing. Lin Xiong: Visualization, Validation. Mao Ye: Investigation, Resources, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Key R&D Program of China (2018YFE0203900), National Natural Science Foundation of China (61773093), Important Science and Technology Innovation Projects in Chengdu, PR China (2018-YF08-00039-GX), Key R&D Programs of Sichuan Science and Technology Department, PR China (2020YFG0476).
References (60)
- et al.
Recognition of COVID-19 disease from X-ray images by hybrid model consisting of 2D curvelet transform, chaotic salp swarm algorithm and deep learning technique
Chaos Solitons Fractals
(2020) - et al.
Digital currency forecasting with chaotic meta-heuristic bio-inspired signal processing techniques
Chaos Solitons Fractals
(2019) - et al.
A new forecasting model with wrapper-based feature selection approach using multi-objective optimization technique for chaotic crude oil time series
Energy
(2020) - et al.
Accurate object detection using memory-based models in surveillance scenes
Pattern Recognit.
(2017) Deep learning in neural networks: An overview
Neural Netw.
(2015)- et al.
Personalized response generation by Dual-learning based domain adaptation
Neural Netw.
(2018) - et al.
Hybrid adversarial network for unsupervised domain adaptation
Inform. Sci.
(2020) - et al.
Neural machine translation by jointly learning to align and translate
(2014) - et al.
A unified multi-scale deep convolutional neural network for fast object detection
- Cai, Z., Vasconcelos, N., 2018. Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE...
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
Histograms of oriented gradients for human detection
Imagenet: A large-scale hierarchical image database
The pascal visual object classes (voc) challenge
Int. J. Comput. Vis.
Object detection with discriminatively trained part-based models
IEEE Trans. Pattern Anal. Mach. Intell.
Unsupervised domain adaptation by backpropagation
Generative adversarial nets
Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?
Maximum density divergence for domain adaptation
IEEE Trans. Pattern Anal. Mach. Intell.
Locality preserving joint transfer for domain adaptation
IEEE Trans. Image Process.
Transfer independently together: A generalized framework for domain adaptation
IEEE Trans. Cybern.
Cited by (17)
Small object detection in unmanned aerial vehicle images using multi-scale hybrid attention
2024, Engineering Applications of Artificial IntelligenceA framework for real-time vehicle counting and velocity estimation using deep learning
2023, Sustainable Computing: Informatics and SystemsHOLT-Net: Detecting smokers via human–object interaction with lite transformer network
2023, Engineering Applications of Artificial IntelligenceMulti-lane detection by combining line anchor and feature shift for urban traffic management
2023, Engineering Applications of Artificial IntelligenceVideo surveillance using deep transfer learning and deep domain adaptation: Towards better generalization
2023, Engineering Applications of Artificial IntelligenceD<sup>3</sup>-Net: Integrated multi-task convolutional neural network for water surface deblurring, dehazing and object detection
2023, Engineering Applications of Artificial IntelligenceCitation Excerpt :Our codes, models and data are available at: https://www.aliyundrive.com/s/SPs2uwy89RN. The purpose of object detection is to complete the classification and localization of objects in images (Bosquet et al., 2020; Wang et al., 2021a; Zhang et al., 2021). Most traditional object detection methods were built with complex handcrafted features and classifiers (Dalal and Triggs, 2005).