An unsupervised domain adaptation scheme for single-stage artwork recognition in cultural sites
Introduction
Recognizing artworks in a cultural site is a key feature for many applications aimed either to provide additional services to the users or to obtain insights into the behavior of the visitors, and hence measure the performance of the cultural site [1]. For example, artwork recognition allows to automatically show additional information about an artwork observed by the visitor through augmented reality [2], or monitor visitor behavior to understand where people spend more time during their visit, as well as to infer which artworks attract their interest [3]. Artwork recognition can be obtained fine-tuning standard object detector architectures (e.g. Faster-RCNN [4], YOLO [5], RetinaNet [6]) on labeled data. However, in order to achieve good performance, object detection algorithms need to be trained on large datasets of manually labeled images. Depending on the cultural site, collecting and labeling visual data can be difficult especially when many artworks are present whose images should be acquired from different points of view. Moreover, labeling these data with bounding box annotations for each artwork is expensive and, since objects must be recognized at the instance level, the collection and labeling efforts must be repeated for each cultural site.
To mitigate the aforementioned problems, a recent work [7] proposed an approach to generate large quantities of synthetic images from the 3D model of a cultural site simulating a visitor navigating the site. Since the position of artworks can been labeled in the 3D model (i.e, one 3D bounding box per artwork), all images during the simulated navigation can be automatically labeled with 2D bounding box annotations. This approach allows to easily generate labeled datasets of arbitrary size which can be used to train an object detection algorithm. Nonetheless, there is a domain gap between the generated and real visual data which the object detector models must deal with at test time. Fig. 1 shows the results of a standard Faster-RCNN model trained on the labeled synthetic images. Due to the domain gap, the model successfully detects artworks on synthetic images, whereas it fails on real images. Infact, these algorithms assume that the images used for training and those on which the algorithm will be tested belong to the same domain distribution. In this context, for example, if an object detector is trained with a dataset of synthetic images and tested on a dataset containing their real counterparts, the performance will drastically drop and, in many cases, the algorithm will not be able to recognize the artworks, as it shown in Fig. 1. In this experimental scenario, the set of training data is generally referred to as “source domain”, whereas the set of test data is called “target domain”. The drop in performance due to domain gap represents a significant limitation since it requires the creation of a dataset of annotated images belonging to the target domain in order to re-train or fine-tune the algorithms. The labeling process, in particular, imposes additional costs in terms of money and time. For this reason, many works have focused on reducing the domain gap by leveraging labeled images belonging to a source domain and only unlabeled images from the target domain. This research area is referred to as “Unsupervised Domain Adaptation” [8], [9].
In this paper, we investigate the use of unsupervised domain adaptation techniques for artwork detection. Specifically, we consider a scenario in which large quantities of labeled synthetic images are available, whereas only unlabeled real images can be used at training time. The synthetic images can be easily obtained starting from a 3D model of the cultural site acquired with a 3D scanner such as Matterport2 and using the tool proposed in [7] to automatically generate the labeled data. The real unlabeled images can be easly collected visiting the cultural site acquiring videos with a wearable camera. Note that, since no manual labeling is required for the real images in the unsupervised settings, this procedure has a low cost. We hence aim to train the object detection models using labeled synthetic images and real unlabeled images. To the best of our knowledge, there are not publicly available datasets to study domain adaptation for artwork detection in cultural sites. Therefore we collect and publicly release a suitable one which we name UDA-CH (Unsupervised Domain Adaptation on Cultural Heritage). We hence study the main unsupervised domain adaptation techniques for object detection on UDA-CH: 1) image-to-image translation and 2) feature alignment. We compare the performance of two popular object detection approaches, Faster R-CNN [4] and RetinaNet [6]. Since in our study RetinaNet obtained results more robust to the domain gap than Faster-RCNN, we propose a novel approach which combines feature alignment techniques based on adversarial learning [9] for unsupervised domain adaptation with the RetinaNet architecture. Our experiments show that the proposed approach greatly outperforms prior art. When combined with image to image translation, our method achieves a mAP of 58.01% on real data without seeing a single labeled real images at training time. To better demonstrate the effectiveness of the proposed method, we have also tested the generalization of the approach in urban scenario exploiting the popular Cityscapes dataset [10], [11].
In sum, the contributions of this paper are as follows: 1) we introduce a new dataset to study synthetic to real unsupervised domain adaptation for artwork detection in cultural sites. The dataset has been acquired from a first person point of view on a real cultural site with 16 artworks; 2) we benchmark different solutions to address unsupervised domain adaptation for artwork detection; 3) we propose a novel architecture based on RetinaNet which obtains better results than similar approaches based on Faster-RCNN. The code of our approach is publicly available at the following link https://github.com/fpv-iplab/DA-RetinaNet; 4) We demonstrate the generalization of the proposed approach considering also a popular dataset of a different domain (i.e urban domain); 5) we analyze the limits of the investigated techniques and discuss future research directions.
The remainder of the paper is organized as follows. In Section 2, we discuss related work. Section 3 presents the compared methods. Section 4 reports the experimental settings and discusses results. Section 5 concludes the paper and summarizes the main findings of our study.
Section snippets
Related works
Our work is related to different lines of research: egocentric vision in cultural sites, object detection, image to image translation, feature alignment for domain adaptation, and unsupervised domain adaptation for object detection. The following sections discuss the relevant works belonging to these research lines.
Methods
We compare several approaches to unsupervised domain adaptation for object detection. Specifically we considered the following: 1) a baseline object detector without adaptation, 2) domain adaptation through image-to-image translation, 3) domain adaptation through feature alignment, 4) the proposed method based on RetinaNet and feature alignment and 5) approaches combining feature alignment and image-to-image translation. In the following section, we give details on all the compared approaches.
Experimental settings and results
This section presents the proposed dataset, reports and analyze the results of the methods presented in the previous section and discusses the computational resources required to train all the models.
Conclusion
We considered the problem of Unsupervised Domain Adaptation for object detection in cultural. To conduct our study, we created a new dataset consisting of 75,244 synthetic images and 2190 real images of 16 artworks, which we publicly release. To better assess generalization of the compared approaches, we have also performed experiment with a dataset related to urban environment. Experiments showed that the proposed DA-RetinaNet method achieves better performance compared to DA-Faster RCNN and
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research is supported by the project VALUE - Visual Analysis for Localization and Understanding of Environments (N. 08CT6209090207 - CUP G69J18001060007) - PO FESR 2014/2020 - Azione 1.1.5., by Piano di incentivi per la ricerca di Ateneo 2020/2022 (Pia.ce.ri.) Linea 2 - University of Catania, and by MIUR AIM - Attrazione e Mobilità Internazionale Linea 1 - AIM1893589 - CUP E64118002540007.
References (40)
- et al.
Ego-ch: dataset and fundamental tasks for visitors behavioral understanding using egocentric vision
Pattern Recogn. Lett.
(2020) - et al.
Deep artwork detection and retrieval for automatic context-aware audio guides
- et al.
Egocentric point of interest recognition in cultural sites
- et al.
Faster r-cnn: Towards real-time object detection with region proposal networks
- J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in:...
- et al.
Focal loss for dense object detection
- et al.
Egocentric visitor localization and artwork detection in cultural sites using synthetic data
Pattern Recogn. Lett.
(2020) - et al.
Adversarial discriminative domain adaptation
- et al.
Unsupervised domain adaptation by backpropagation, in: International conference on machine learning
(2015) - et al.
The cityscapes dataset for semantic urban scene understanding
Semantic foggy scene understanding with synthetic data
Int. J. Comput. Vis.
Visions for augmented cultural heritage experience
IEEE MultiMedia
Fully convolutional network and region proposal for instance identification with egocentric vision
Object detection and classification in surveillance system
Detecting activities of daily living in first-person camera views
Multi-View 3d Object Detection Network for Autonomous Driving, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Pp
Cascade r-cnn: Delving into high quality object detection
Mask r-cnn
Ssd: Single shot multibox detector
Beyond sharing weights for deep domain adaptation
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (31)
Assessing domain gap for continual domain adaptation in object detection
2024, Computer Vision and Image UnderstandingLLEDA—Lifelong Self-Supervised Domain Adaptation
2023, Knowledge-Based SystemsMitosis domain generalization in histopathology images — The MIDOG challenge
2023, Medical Image AnalysisA multi camera unsupervised domain adaptation pipeline for object detection in cultural sites through adversarial learning and self-training
2022, Computer Vision and Image UnderstandingCitation Excerpt :To study whether these state-of-the-art methods can be used to tackle multi-camera domain adaptation, we consider a naive approach which merges the two target domains into a single one. In particular, we considered the following unsupervised domain adaptation methods for object detection: DA-Faster RCNN (Chen et al., 2018), Strong Weak (Saito et al., 2019), DA-RetinaNet (Pasqualino et al., 2021) and CDSSL (Yu et al., 2019). Feature alignment methods aim to reduce the difference between source and target domains at the feature level without taking into account the difference at pixel level (like style, color, shape etc.) which are present between the source and targets domains.
A novel multiscale feature adversarial fusion network for unsupervised cross-domain fault diagnosis
2022, Measurement: Journal of the International Measurement ConfederationCitation Excerpt :The DA aims to adapt a model built in a source domain for use in a different but related unlabeled target domain, and it releases the assumption that the data from the source and target domains obey the same distribution by learning the shared feature representation invariant to both domains. Given the DA’s prominent characteristics, it has gained promising applications, including image classification, natural language processing, and object recognition [13–15]. In the current literature, minimizing domain distribution discrepancy and constructing adversarial learning strategies have become the most used approaches in DA.
A comprehensive survey on object detection in Visual Art: taxonomy and challenge
2024, Multimedia Tools and Applications
- 1
These authors are co-first authors and contributed equally to this work.