Local–Global Attentive Adaptation for Object Detection

https://doi.org/10.1016/j.engappai.2021.104208Get rights and content

Abstract

Adversarial adaptive methods have been proven to be useful for domain transfer in many fields such as image recognition and semantic segmentation, etc However, for object detection, since each image could have different combinations of objects, brutally aligning all the images without considering their transferability may cause the notorious phenomena named ‘negative transfer’. On the other hand, strong matching the local-level features makes sense, as it not only reduces the discrepancy between different domain distributions, but preserves the category-level semantic information. However, it is hard to markedly achieve domain invariance using a simple adversarial adaptive method. In this work, we propose an effective method termed Local–Global Attentive Adaptation for object Detection (LGAAD). Our method can alleviate the negative transfer caused by improper global alignments through leveraging an adaptively and dynamically weighted transferability to highlight the more transferable images. Furthermore, the proposed method also achieves the strong matching between two domains at local-level features to alleviate the cross-domain discrepancy by using the attention mechanism after multiple local discriminators. Additionally, we also consider the domain impacts of instance-wise features and backgrounds in images with large domain divergence, a non-negligible factor for improving the domain adaptive detection model performance. Extensive experiments of various domain shift scenarios show that our method exceeds the state-of-the-art results on several public datasets. Furthermore, qualitative visualization and ablation analyzes can demonstrate the validity of our approach for attending the interested regions and instances on domain adaptation.

Introduction

Object detection is a fundamental research field in computer vision. It aims to locate and identify the interested regions in an image simultaneously. With the rapid development of deep learning in recent years (Schmidhuber, 2015), deep learning-based object detection has made a significant breakthrough and becomes the mainstream method, such as Fast RCNN (Girshick, 2015), Faster RCNN (Ren et al., 2015), RetinaNet (Lin et al., 2017), Cascaded R-CNN (Cai and Vasconcelos, 2018), SSD (Liu et al., 2016), YOLO (Redmon et al., 2016) and memory-based detector (Li et al., 2017a), etc. The deep network training has a great challenge, i.e., it requires large amounts of labeled training data. Hence, a lot of manpower and financial resources will be wasted. Another drawback of deep networks is that it assumes the training data and the test data follow the same distribution – a rarely real condition since the source domain data and the target domain data are actually collected in different environments (refer to Fig. 1).

To this end, numerous unsupervised domain adaption (UDA) methods have been proposed to address the challenges involved above. The goal of UDA is to train a model based on the labeled source dataset, which can be generalized and applied to another related but unlabeled target dataset with different distributions. As for object detection situation, the problem is more intractable as it not only needs to identify but also requires to locate multiple objects in an image. Recently, many domain adaptive methods for object detection are proposed based on the adversarial paradigm (Chen et al., 2018, Wang et al., 2019b, Zhang et al., 2019a, Xu et al., 2020). These approaches force the feature extractor to extract the domain invariant features which cannot be distinguished by a domain discriminator. For instance, the method in Chen et al. (2018) and Xu et al. (2020) tries to align the features between the source and target domains at both global-level and instance-level features. Wang et al. (2019b) and Zhang et al. (2019a) use CycleGAN (Zhu et al., 2017) to force the alignment of the distribution between the two domains at the pixel and feature levels. Although these methods can reduce the distance between the two domains to a certain extent, straightforward feature matching may derivatives some domain deviation problems, because not all source data can be used for feature matching to transfer domain knowledge. Consequently, this kind of feature matching may only work well for two domains with a small gap. When the discrepancy of the distribution between domains becomes large, naively matching all features and images between domains will cause the notorious phenomena ‘negative transfer’.

Fortunately, a part of recent work in domain adaptive object detection has noticed the problem of full feature matching. At the global-level matching, selecting good images for transfer is realized in Saito et al. (2019) and Xie et al. (2019). Xie et al. (2019) aligns domains at multiple levels. Saito et al. (2019) uses the FocalLoss (FL) function (Lin et al., 2017) to perform weak alignment on global-level features and leverages an adversarial loss to perform strong alignment on local-level features(i.e., the regions on low-level features). However, the FL function’s focus is only to reduce the influence of many simple negative samples during training and to neglect the images with a significant domain divergence. The trained detection model will then not perform well on such images. Besides, only naive generative adversarial strategy is used for aligning the multi-level features in these methods, which cannot effectively reduce the cross-domain gap. Unlike them, the method in Zhu et al. (2019) only mines the significant instances to transfer. However, these instances are aligned at the ROI-Pooling based features, while the instances are independent representations of region features without information interaction, and the location accuracy of the instance cannot be guaranteed.

Based on the above discussion, we can find that the existing domain adaptive object detection methods typically aim to forcibly match the feature distribution of two domains at single or multiple levels. They do not consider the practical problem: not all the local and global features have the same transferability. In addition, the contribution of instance and background features in each image is also different for domain transfer. In order to address the above problems, we introduce an attention mechanism strategy (Vaswani et al., 2017, Wang et al., 2017a) in the Faster RCNN model (Ren et al., 2015) to explore the application of attention mechanism in adaptative object detection to alleviate the challenges above. We mainly apply the attention mechanism to local and global levels. For local-level features, we use the local-level attention generated by multiple domain discriminators to highlight the local features with large domain discrepancy, so that the knowledge of two domains can be learned collaboratively better to promote domain knowledge transfer than the previous methods. For global-level features, we use the attention generated by the FL function (Lin et al., 2017) to focus on the objects (i.e., instances) instance of the whole image feature or background feature. The global-level features designed based on the following basics: the context vectors from local-level and global-level features, which contain the information about the whole image, are fused to the instance features to induce the subsequent detection. However, for difficult transferable images, we should pay more attention to the objects (i.e., instances) in them and place less emphasis on the context that cannot be transferred, since the background of the difficult samples may cause negative transfer, which was not considered in previous studies (Saito et al., 2019, Wang et al., 2019a). Thus the attention value is incorporated into the ROI-Pooling based features to focus more on the instances than the context. In general, our contributions can be summarized as follows.

(1) A practical framework is proposed for domain adaptive object detection. The distribution divergence of the local-level features between different domains can be reduced, and some reliable images are selected for the global-level feature alignment.

(2) Attention is proposed for selection. The local attention is used to promote collaborative learning between two domains. And the instances in difficult images are focused by the attention mechanism.

(4) Extensive experiments on multiple widely-used datasets demonstrate the effectiveness of our method.

We will detail our LGAAD in Section 3. Before that, some related works will be reviewed in Section 2. Experimental results are reported in Section 4, and Section 5 is the conclusion.

Section snippets

Related work

Object Detection. With the rapid development of deep technology in recent years, deep learning is used in more applications. For example, Altan et al. (2019) uses long short-term memory neural network and empirical wavelet transform to forecast the digital currencies, Altan and Karasu (2020) combine with deep learning technology to recognize COVID-19 disease from X-ray images, and the task of Karasu et al. (2020) is to forecast the price of crude oil with high precision. While the object

Method

Our proposed method highlights the local features with larger domain discrepancy by introducing a local-level attention mechanism to significantly reduce the gap between the two domains, and utilize FL function (Lin et al., 2017) to focus on the transferable images in two domains to alleviate the negative transfer caused by improper global-feature alignments. Meanwhile, we utilize context vectors to induce the subsequent detection to minimize the detection loss. and a global-level attention

Experiments

In this section, we conduct two types of experiments to verify the effectiveness of our proposed method, i.e., the domain adaptation with small and large cross-domain discrepancy. The experimental results obtained are compared with other outstanding methods, which verifies the superiority of our method. In the end, we illustrate the effectiveness of our proposed method through ablation analysis, visualization of attention map, qualitative analysis, and feature visualization.

Conclusions

In this paper, we propose a domain adaptive framework for object detection based on the attention mechanism. On the one hand, the compulsive alignment of local features between the source and target domains reduces the domain gap without changing category-level semantic information. On the other hand, matching the transferable images between the source domain and target domain using the FL function alleviates the negative transfer caused by improper image alignments. Meanwhile, we utilize

CRediT authorship contribution statement

Dan Zhang: Methodology, Writing - original draft, Software. Jingjing Li: Conceptualization, Software. Xingpeng Li: Data curation, Validation. Zhekai Du: Supervision, Writing - review & editing. Lin Xiong: Visualization, Validation. Mao Ye: Investigation, Resources, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Key R&D Program of China (2018YFE0203900), National Natural Science Foundation of China (61773093), Important Science and Technology Innovation Projects in Chengdu, PR China (2018-YF08-00039-GX), Key R&D Programs of Sichuan Science and Technology Department, PR China (2020YFG0476).

References (60)

  • ChanW. et al.

    Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

  • Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L., 2018. Domain adaptive faster r-cnn for object detection in the...
  • Chen, L.-C., Yang, Y., Wang, J., Xu, W., Yuille, A.L., 2016. Attention to scale: Scale-aware semantic image...
  • Chen, C., Zheng, Z., Ding, X., Huang, Y., Dou, Q., 2020. Harmonizing transferability and discriminability for adapting...
  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B., 2016....
  • DalalN. et al.

    Histograms of oriented gradients for human detection

  • DengJ. et al.

    Imagenet: A large-scale hierarchical image database

  • EveringhamM. et al.

    The pascal visual object classes (voc) challenge

    Int. J. Comput. Vis.

    (2010)
  • FelzenszwalbP.F. et al.

    Object detection with discriminatively trained part-based models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • GaninY. et al.

    Unsupervised domain adaptation by backpropagation

    (2014)
  • Girshick, R., 2015. Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. pp....
  • GoodfellowI. et al.

    Generative adversarial nets

  • He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE...
  • Inoue, N., Furuta, R., Yamasaki, T., Aizawa, K., 2018. Cross-domain weakly-supervised object detection through...
  • Jiang, J., Zhai, C., 2007. Instance weighting for domain adaptation in NLP. In: Proceedings of the 45th Annual Meeting...
  • Johnson-RobersonM. et al.

    Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?

    (2016)
  • Kim, T., Jeong, M., Kim, S., Choi, S., Kim, C., 2019. Diversify and match: A domain adaptive representation learning...
  • LiJ. et al.

    Maximum density divergence for domain adaptation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2020)
  • LiJ. et al.

    Locality preserving joint transfer for domain adaptation

    IEEE Trans. Image Process.

    (2019)
  • LiJ. et al.

    Transfer independently together: A generalized framework for domain adaptation

    IEEE Trans. Cybern.

    (2018)
  • Cited by (17)

    • D<sup>3</sup>-Net: Integrated multi-task convolutional neural network for water surface deblurring, dehazing and object detection

      2023, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Our codes, models and data are available at: https://www.aliyundrive.com/s/SPs2uwy89RN. The purpose of object detection is to complete the classification and localization of objects in images (Bosquet et al., 2020; Wang et al., 2021a; Zhang et al., 2021). Most traditional object detection methods were built with complex handcrafted features and classifiers (Dalal and Triggs, 2005).

    View all citing articles on Scopus
    View full text