An unsupervised domain adaptation scheme for single-stage artwork recognition in cultural sites

https://doi.org/10.1016/j.imavis.2021.104098Get rights and content

Highlights

  • A new dataset to study unsupervised domain adaptation methods for object detection.

  • An in-depth study of unsupervised domain adaptation methods for artwork detection.

  • A new unsupervised domain adaptation method which generalize over different datasets.

Abstract

Recognizing artworks in a cultural site using images acquired from the user's point of view (First Person Vision) allows to build interesting applications for both the visitors and the site managers. However, current object detection algorithms working in fully supervised settings need to be trained with large quantities of labeled data, whose collection requires a lot of times and high costs in order to achieve good performance. Using synthetic data generated from the 3D model of the cultural site to train the algorithms can reduce these costs. On the other hand, when these models are tested with real images, a significant drop in performance is observed due to the differences between real and synthetic images. In this study we consider the problem of Unsupervised Domain Adaptation for object detection in cultural sites. To address this problem, we created a new dataset containing both synthetic and real images of 16 different artworks. We hence investigated different domain adaptation techniques based on one-stage and two-stage object detector, image-to-image translation and feature alignment. Based on the observation that single-stage detectors are more robust to the domain shift in the considered settings, we proposed a new method which builds on RetinaNet and feature alignment that we called DA-RetinaNet. The proposed approach achieves better results than compared methods on the proposed dataset and on Cityscapes. To support research in this field we release the dataset at the following link https://iplab.dmi.unict.it/EGO-CH-OBJ-UDA/ and the code of the proposed architecture at https://github.com/fpv-iplab/DA-RetinaNet.

Introduction

Recognizing artworks in a cultural site is a key feature for many applications aimed either to provide additional services to the users or to obtain insights into the behavior of the visitors, and hence measure the performance of the cultural site [1]. For example, artwork recognition allows to automatically show additional information about an artwork observed by the visitor through augmented reality [2], or monitor visitor behavior to understand where people spend more time during their visit, as well as to infer which artworks attract their interest [3]. Artwork recognition can be obtained fine-tuning standard object detector architectures (e.g. Faster-RCNN [4], YOLO [5], RetinaNet [6]) on labeled data. However, in order to achieve good performance, object detection algorithms need to be trained on large datasets of manually labeled images. Depending on the cultural site, collecting and labeling visual data can be difficult especially when many artworks are present whose images should be acquired from different points of view. Moreover, labeling these data with bounding box annotations for each artwork is expensive and, since objects must be recognized at the instance level, the collection and labeling efforts must be repeated for each cultural site.

To mitigate the aforementioned problems, a recent work [7] proposed an approach to generate large quantities of synthetic images from the 3D model of a cultural site simulating a visitor navigating the site. Since the position of artworks can been labeled in the 3D model (i.e, one 3D bounding box per artwork), all images during the simulated navigation can be automatically labeled with 2D bounding box annotations. This approach allows to easily generate labeled datasets of arbitrary size which can be used to train an object detection algorithm. Nonetheless, there is a domain gap between the generated and real visual data which the object detector models must deal with at test time. Fig. 1 shows the results of a standard Faster-RCNN model trained on the labeled synthetic images. Due to the domain gap, the model successfully detects artworks on synthetic images, whereas it fails on real images. Infact, these algorithms assume that the images used for training and those on which the algorithm will be tested belong to the same domain distribution. In this context, for example, if an object detector is trained with a dataset of synthetic images and tested on a dataset containing their real counterparts, the performance will drastically drop and, in many cases, the algorithm will not be able to recognize the artworks, as it shown in Fig. 1. In this experimental scenario, the set of training data is generally referred to as “source domain”, whereas the set of test data is called “target domain”. The drop in performance due to domain gap represents a significant limitation since it requires the creation of a dataset of annotated images belonging to the target domain in order to re-train or fine-tune the algorithms. The labeling process, in particular, imposes additional costs in terms of money and time. For this reason, many works have focused on reducing the domain gap by leveraging labeled images belonging to a source domain and only unlabeled images from the target domain. This research area is referred to as “Unsupervised Domain Adaptation” [8], [9].

In this paper, we investigate the use of unsupervised domain adaptation techniques for artwork detection. Specifically, we consider a scenario in which large quantities of labeled synthetic images are available, whereas only unlabeled real images can be used at training time. The synthetic images can be easily obtained starting from a 3D model of the cultural site acquired with a 3D scanner such as Matterport2 and using the tool proposed in [7] to automatically generate the labeled data. The real unlabeled images can be easly collected visiting the cultural site acquiring videos with a wearable camera. Note that, since no manual labeling is required for the real images in the unsupervised settings, this procedure has a low cost. We hence aim to train the object detection models using labeled synthetic images and real unlabeled images. To the best of our knowledge, there are not publicly available datasets to study domain adaptation for artwork detection in cultural sites. Therefore we collect and publicly release a suitable one which we name UDA-CH (Unsupervised Domain Adaptation on Cultural Heritage). We hence study the main unsupervised domain adaptation techniques for object detection on UDA-CH: 1) image-to-image translation and 2) feature alignment. We compare the performance of two popular object detection approaches, Faster R-CNN [4] and RetinaNet [6]. Since in our study RetinaNet obtained results more robust to the domain gap than Faster-RCNN, we propose a novel approach which combines feature alignment techniques based on adversarial learning [9] for unsupervised domain adaptation with the RetinaNet architecture. Our experiments show that the proposed approach greatly outperforms prior art. When combined with image to image translation, our method achieves a mAP of 58.01% on real data without seeing a single labeled real images at training time. To better demonstrate the effectiveness of the proposed method, we have also tested the generalization of the approach in urban scenario exploiting the popular Cityscapes dataset [10], [11].

In sum, the contributions of this paper are as follows: 1) we introduce a new dataset to study synthetic to real unsupervised domain adaptation for artwork detection in cultural sites. The dataset has been acquired from a first person point of view on a real cultural site with 16 artworks; 2) we benchmark different solutions to address unsupervised domain adaptation for artwork detection; 3) we propose a novel architecture based on RetinaNet which obtains better results than similar approaches based on Faster-RCNN. The code of our approach is publicly available at the following link https://github.com/fpv-iplab/DA-RetinaNet; 4) We demonstrate the generalization of the proposed approach considering also a popular dataset of a different domain (i.e urban domain); 5) we analyze the limits of the investigated techniques and discuss future research directions.

The remainder of the paper is organized as follows. In Section 2, we discuss related work. Section 3 presents the compared methods. Section 4 reports the experimental settings and discusses results. Section 5 concludes the paper and summarizes the main findings of our study.

Section snippets

Related works

Our work is related to different lines of research: egocentric vision in cultural sites, object detection, image to image translation, feature alignment for domain adaptation, and unsupervised domain adaptation for object detection. The following sections discuss the relevant works belonging to these research lines.

Methods

We compare several approaches to unsupervised domain adaptation for object detection. Specifically we considered the following: 1) a baseline object detector without adaptation, 2) domain adaptation through image-to-image translation, 3) domain adaptation through feature alignment, 4) the proposed method based on RetinaNet and feature alignment and 5) approaches combining feature alignment and image-to-image translation. In the following section, we give details on all the compared approaches.

Experimental settings and results

This section presents the proposed dataset, reports and analyze the results of the methods presented in the previous section and discusses the computational resources required to train all the models.

Conclusion

We considered the problem of Unsupervised Domain Adaptation for object detection in cultural. To conduct our study, we created a new dataset consisting of 75,244 synthetic images and 2190 real images of 16 artworks, which we publicly release. To better assess generalization of the compared approaches, we have also performed experiment with a dataset related to urban environment. Experiments showed that the proposed DA-RetinaNet method achieves better performance compared to DA-Faster RCNN and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is supported by the project VALUE - Visual Analysis for Localization and Understanding of Environments (N. 08CT6209090207 - CUP G69J18001060007) - PO FESR 2014/2020 - Azione 1.1.5., by Piano di incentivi per la ricerca di Ateneo 2020/2022 (Pia.ce.ri.) Linea 2 - University of Catania, and by MIUR AIM - Attrazione e Mobilità Internazionale Linea 1 - AIM1893589 - CUP E64118002540007.

References (40)

  • F. Ragusa et al.

    Ego-ch: dataset and fundamental tasks for visitors behavioral understanding using egocentric vision

    Pattern Recogn. Lett.

    (2020)
  • L. Seidenari et al.

    Deep artwork detection and retrieval for automatic context-aware audio guides

  • F. Ragusa et al.

    Egocentric point of interest recognition in cultural sites

  • S. Ren et al.

    Faster r-cnn: Towards real-time object detection with region proposal networks

  • J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in:...
  • T.-Y. Lin et al.

    Focal loss for dense object detection

  • S.A. Orlando et al.

    Egocentric visitor localization and artwork detection in cultural sites using synthetic data

    Pattern Recogn. Lett.

    (2020)
  • E. Tzeng et al.

    Adversarial discriminative domain adaptation

  • Y. Ganin et al.

    Unsupervised domain adaptation by backpropagation, in: International conference on machine learning

    (2015)
  • M. Cordts et al.

    The cityscapes dataset for semantic urban scene understanding

  • C. Sakaridis et al.

    Semantic foggy scene understanding with synthetic data

    Int. J. Comput. Vis.

    (2018)
  • R. Cucchiara et al.

    Visions for augmented cultural heritage experience

    IEEE MultiMedia

    (2014)
  • M. Portaz et al.

    Fully convolutional network and region proposal for instance identification with egocentric vision

  • S. Varma et al.

    Object detection and classification in surveillance system

  • H. Pirsiavash et al.

    Detecting activities of daily living in first-person camera views

  • X. Chen et al.

    Multi-View 3d Object Detection Network for Autonomous Driving, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Pp

    (1907–1915)
  • Z. Cai et al.

    Cascade r-cnn: Delving into high quality object detection

  • K. He et al.

    Mask r-cnn

  • W. Liu et al.

    Ssd: Single shot multibox detector

  • A. Rozantsev et al.

    Beyond sharing weights for deep domain adaptation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • Cited by (31)

    • A multi camera unsupervised domain adaptation pipeline for object detection in cultural sites through adversarial learning and self-training

      2022, Computer Vision and Image Understanding
      Citation Excerpt :

      To study whether these state-of-the-art methods can be used to tackle multi-camera domain adaptation, we consider a naive approach which merges the two target domains into a single one. In particular, we considered the following unsupervised domain adaptation methods for object detection: DA-Faster RCNN (Chen et al., 2018), Strong Weak (Saito et al., 2019), DA-RetinaNet (Pasqualino et al., 2021) and CDSSL (Yu et al., 2019). Feature alignment methods aim to reduce the difference between source and target domains at the feature level without taking into account the difference at pixel level (like style, color, shape etc.) which are present between the source and targets domains.

    • A novel multiscale feature adversarial fusion network for unsupervised cross-domain fault diagnosis

      2022, Measurement: Journal of the International Measurement Confederation
      Citation Excerpt :

      The DA aims to adapt a model built in a source domain for use in a different but related unlabeled target domain, and it releases the assumption that the data from the source and target domains obey the same distribution by learning the shared feature representation invariant to both domains. Given the DA’s prominent characteristics, it has gained promising applications, including image classification, natural language processing, and object recognition [13–15]. In the current literature, minimizing domain distribution discrepancy and constructing adversarial learning strategies have become the most used approaches in DA.

    View all citing articles on Scopus
    1

    These authors are co-first authors and contributed equally to this work.

    View full text