An unsupervised domain adaptation scheme for single-stage artwork recognition in cultural sites

doi:10.1016/j.imavis.2021.104098

Image and Vision Computing

Volume 107, March 2021, 104098

https://doi.org/10.1016/j.imavis.2021.104098 Get rights and content

Highlights

•
A new dataset to study unsupervised domain adaptation methods for object detection.
•
An in-depth study of unsupervised domain adaptation methods for artwork detection.
•
A new unsupervised domain adaptation method which generalize over different datasets.

Abstract

Recognizing artworks in a cultural site using images acquired from the user's point of view (First Person Vision) allows to build interesting applications for both the visitors and the site managers. However, current object detection algorithms working in fully supervised settings need to be trained with large quantities of labeled data, whose collection requires a lot of times and high costs in order to achieve good performance. Using synthetic data generated from the 3D model of the cultural site to train the algorithms can reduce these costs. On the other hand, when these models are tested with real images, a significant drop in performance is observed due to the differences between real and synthetic images. In this study we consider the problem of Unsupervised Domain Adaptation for object detection in cultural sites. To address this problem, we created a new dataset containing both synthetic and real images of 16 different artworks. We hence investigated different domain adaptation techniques based on one-stage and two-stage object detector, image-to-image translation and feature alignment. Based on the observation that single-stage detectors are more robust to the domain shift in the considered settings, we proposed a new method which builds on RetinaNet and feature alignment that we called DA-RetinaNet. The proposed approach achieves better results than compared methods on the proposed dataset and on Cityscapes. To support research in this field we release the dataset at the following link https://iplab.dmi.unict.it/EGO-CH-OBJ-UDA/ and the code of the proposed architecture at https://github.com/fpv-iplab/DA-RetinaNet.

Introduction

Recognizing artworks in a cultural site is a key feature for many applications aimed either to provide additional services to the users or to obtain insights into the behavior of the visitors, and hence measure the performance of the cultural site [1]. For example, artwork recognition allows to automatically show additional information about an artwork observed by the visitor through augmented reality [2], or monitor visitor behavior to understand where people spend more time during their visit, as well as to infer which artworks attract their interest [3]. Artwork recognition can be obtained fine-tuning standard object detector architectures (e.g. Faster-RCNN [4], YOLO [5], RetinaNet [6]) on labeled data. However, in order to achieve good performance, object detection algorithms need to be trained on large datasets of manually labeled images. Depending on the cultural site, collecting and labeling visual data can be difficult especially when many artworks are present whose images should be acquired from different points of view. Moreover, labeling these data with bounding box annotations for each artwork is expensive and, since objects must be recognized at the instance level, the collection and labeling efforts must be repeated for each cultural site.

To mitigate the aforementioned problems, a recent work [7] proposed an approach to generate large quantities of synthetic images from the 3D model of a cultural site simulating a visitor navigating the site. Since the position of artworks can been labeled in the 3D model (i.e, one 3D bounding box per artwork), all images during the simulated navigation can be automatically labeled with 2D bounding box annotations. This approach allows to easily generate labeled datasets of arbitrary size which can be used to train an object detection algorithm. Nonetheless, there is a domain gap between the generated and real visual data which the object detector models must deal with at test time. Fig. 1 shows the results of a standard Faster-RCNN model trained on the labeled synthetic images. Due to the domain gap, the model successfully detects artworks on synthetic images, whereas it fails on real images. Infact, these algorithms assume that the images used for training and those on which the algorithm will be tested belong to the same domain distribution. In this context, for example, if an object detector is trained with a dataset of synthetic images and tested on a dataset containing their real counterparts, the performance will drastically drop and, in many cases, the algorithm will not be able to recognize the artworks, as it shown in Fig. 1. In this experimental scenario, the set of training data is generally referred to as “source domain”, whereas the set of test data is called “target domain”. The drop in performance due to domain gap represents a significant limitation since it requires the creation of a dataset of annotated images belonging to the target domain in order to re-train or fine-tune the algorithms. The labeling process, in particular, imposes additional costs in terms of money and time. For this reason, many works have focused on reducing the domain gap by leveraging labeled images belonging to a source domain and only unlabeled images from the target domain. This research area is referred to as “Unsupervised Domain Adaptation” [8], [9].

In this paper, we investigate the use of unsupervised domain adaptation techniques for artwork detection. Specifically, we consider a scenario in which large quantities of labeled synthetic images are available, whereas only unlabeled real images can be used at training time. The synthetic images can be easily obtained starting from a 3D model of the cultural site acquired with a 3D scanner such as Matterport² and using the tool proposed in [7] to automatically generate the labeled data. The real unlabeled images can be easly collected visiting the cultural site acquiring videos with a wearable camera. Note that, since no manual labeling is required for the real images in the unsupervised settings, this procedure has a low cost. We hence aim to train the object detection models using labeled synthetic images and real unlabeled images. To the best of our knowledge, there are not publicly available datasets to study domain adaptation for artwork detection in cultural sites. Therefore we collect and publicly release a suitable one which we name UDA-CH (Unsupervised Domain Adaptation on Cultural Heritage). We hence study the main unsupervised domain adaptation techniques for object detection on UDA-CH: 1) image-to-image translation and 2) feature alignment. We compare the performance of two popular object detection approaches, Faster R-CNN [4] and RetinaNet [6]. Since in our study RetinaNet obtained results more robust to the domain gap than Faster-RCNN, we propose a novel approach which combines feature alignment techniques based on adversarial learning [9] for unsupervised domain adaptation with the RetinaNet architecture. Our experiments show that the proposed approach greatly outperforms prior art. When combined with image to image translation, our method achieves a mAP of 58.01% on real data without seeing a single labeled real images at training time. To better demonstrate the effectiveness of the proposed method, we have also tested the generalization of the approach in urban scenario exploiting the popular Cityscapes dataset [10], [11].

In sum, the contributions of this paper are as follows: 1) we introduce a new dataset to study synthetic to real unsupervised domain adaptation for artwork detection in cultural sites. The dataset has been acquired from a first person point of view on a real cultural site with 16 artworks; 2) we benchmark different solutions to address unsupervised domain adaptation for artwork detection; 3) we propose a novel architecture based on RetinaNet which obtains better results than similar approaches based on Faster-RCNN. The code of our approach is publicly available at the following link https://github.com/fpv-iplab/DA-RetinaNet; 4) We demonstrate the generalization of the proposed approach considering also a popular dataset of a different domain (i.e urban domain); 5) we analyze the limits of the investigated techniques and discuss future research directions.

The remainder of the paper is organized as follows. In Section 2, we discuss related work. Section 3 presents the compared methods. Section 4 reports the experimental settings and discusses results. Section 5 concludes the paper and summarizes the main findings of our study.

Section snippets

Related works

Our work is related to different lines of research: egocentric vision in cultural sites, object detection, image to image translation, feature alignment for domain adaptation, and unsupervised domain adaptation for object detection. The following sections discuss the relevant works belonging to these research lines.

Methods

We compare several approaches to unsupervised domain adaptation for object detection. Specifically we considered the following: 1) a baseline object detector without adaptation, 2) domain adaptation through image-to-image translation, 3) domain adaptation through feature alignment, 4) the proposed method based on RetinaNet and feature alignment and 5) approaches combining feature alignment and image-to-image translation. In the following section, we give details on all the compared approaches.

Experimental settings and results

This section presents the proposed dataset, reports and analyze the results of the methods presented in the previous section and discusses the computational resources required to train all the models.

Conclusion

We considered the problem of Unsupervised Domain Adaptation for object detection in cultural. To conduct our study, we created a new dataset consisting of 75,244 synthetic images and 2190 real images of 16 artworks, which we publicly release. To better assess generalization of the compared approaches, we have also performed experiment with a dataset related to urban environment. Experiments showed that the proposed DA-RetinaNet method achieves better performance compared to DA-Faster RCNN and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is supported by the project VALUE - Visual Analysis for Localization and Understanding of Environments (N. 08CT6209090207 - CUP G69J18001060007) - PO FESR 2014/2020 - Azione 1.1.5., by Piano di incentivi per la ricerca di Ateneo 2020/2022 (Pia.ce.ri.) Linea 2 - University of Catania, and by MIUR AIM - Attrazione e Mobilità Internazionale Linea 1 - AIM1893589 - CUP E64118002540007.

References (40)

F. Ragusa et al.
Ego-ch: dataset and fundamental tasks for visitors behavioral understanding using egocentric vision
Pattern Recogn. Lett.
(2020)
L. Seidenari et al.
Deep artwork detection and retrieval for automatic context-aware audio guides
F. Ragusa et al.
Egocentric point of interest recognition in cultural sites
S. Ren et al.
Faster r-cnn: Towards real-time object detection with region proposal networks
J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in:...
T.-Y. Lin et al.
Focal loss for dense object detection
S.A. Orlando et al.
Egocentric visitor localization and artwork detection in cultural sites using synthetic data
Pattern Recogn. Lett.
(2020)
E. Tzeng et al.
Adversarial discriminative domain adaptation
Y. Ganin et al.
Unsupervised domain adaptation by backpropagation, in: International conference on machine learning
(2015)
M. Cordts et al.
The cityscapes dataset for semantic urban scene understanding

C. Sakaridis et al.

Semantic foggy scene understanding with synthetic data

Int. J. Comput. Vis.

(2018)

R. Cucchiara et al.

Visions for augmented cultural heritage experience

IEEE MultiMedia

(2014)

M. Portaz et al.

Fully convolutional network and region proposal for instance identification with egocentric vision

S. Varma et al.

Object detection and classification in surveillance system

H. Pirsiavash et al.

Detecting activities of daily living in first-person camera views

X. Chen et al.

Multi-View 3d Object Detection Network for Autonomous Driving, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Pp

(1907–1915)

Z. Cai et al.

Cascade r-cnn: Delving into high quality object detection

K. He et al.

Mask r-cnn

W. Liu et al.

Ssd: Single shot multibox detector

A. Rozantsev et al.

Beyond sharing weights for deep domain adaptation

IEEE Trans. Pattern Anal. Mach. Intell.

(2018)

Cited by (31)

Assessing domain gap for continual domain adaptation in object detection
2024, Computer Vision and Image Understanding
To ensure reliable object detection in autonomous systems, the detector must be able to adapt to changes in appearance caused by environmental factors such as time of day, weather, and seasons. Continually adapting the detector to incorporate these changes is a promising solution, but it can be computationally costly. Our proposed approach is to selectively adapt the detector only when necessary, using new data that does not have the same distribution as the current training data. To this end, we investigate three popular metrics for domain gap evaluation and find that there is a correlation between the domain gap and detection accuracy. Therefore, we apply the domain gap as a criterion to decide when to adapt the detector. Our experiments show that our approach has the potential to improve the efficiency of the detector’s operation in real-world scenarios, where environmental conditions change in a cyclical manner, without sacrificing the overall performance of the detector. Our code is publicly available https://github.com/dadung/DGE-CDA.
LLEDA—Lifelong Self-Supervised Domain Adaptation
2023, Knowledge-Based Systems
Humans and animals have the ability to continuously learn new information over their lifetime without losing previously acquired knowledge. However, artificial neural networks struggle with this due to new information conflicting with old knowledge, resulting in catastrophic forgetting. The complementary learning systems (CLS) theory (McClelland and McNaughton, 1995; Kumaran et al. 2016) suggests that the interplay between hippocampus and neocortex systems enables long-term and efficient learning in the mammalian brain, with memory replay facilitating the interaction between these two systems to reduce forgetting. The proposed Lifelong Self-Supervised Domain Adaptation (LLEDA) framework draws inspiration from the CLS theory and mimics the interaction between two networks: a DA network inspired by the hippocampus that quickly adjusts to changes in data distribution and an SSL network inspired by the neocortex that gradually learns domain-agnostic general representations. LLEDA’s latent replay technique facilitates communication between these two networks by reactivating and replaying the past memory latent representations to stabilize long-term generalization and retention without interfering with the previously learned information. Extensive experiments demonstrate that the proposed method outperforms several other methods resulting in a long-term adaptation while being less prone to catastrophic forgetting when transferred to new domains.
Mitosis domain generalization in histopathology images — The MIDOG challenge
2023, Medical Image Analysis
The density of mitotic figures (MF) within tumor tissue is known to be highly correlated with tumor proliferation and thus is an important marker in tumor grading. Recognition of MF by pathologists is subject to a strong inter-rater bias, limiting its prognostic value. State-of-the-art deep learning methods can support experts but have been observed to strongly deteriorate when applied in a different clinical environment. The variability caused by using different whole slide scanners has been identified as one decisive component in the underlying domain shift. The goal of the MICCAI MIDOG 2021 challenge was the creation of scanner-agnostic MF detection algorithms. The challenge used a training set of 200 cases, split across four scanning systems. As test set, an additional 100 cases split across four scanning systems, including two previously unseen scanners, were provided. In this paper, we evaluate and compare the approaches that were submitted to the challenge and identify methodological factors contributing to better performance. The winning algorithm yielded an $F_{1}$ score of 0.748 (CI95: 0.704-0.781), exceeding the performance of six experts on the same task.
A multi camera unsupervised domain adaptation pipeline for object detection in cultural sites through adversarial learning and self-training
2022, Computer Vision and Image Understanding
Citation Excerpt :
To study whether these state-of-the-art methods can be used to tackle multi-camera domain adaptation, we consider a naive approach which merges the two target domains into a single one. In particular, we considered the following unsupervised domain adaptation methods for object detection: DA-Faster RCNN (Chen et al., 2018), Strong Weak (Saito et al., 2019), DA-RetinaNet (Pasqualino et al., 2021) and CDSSL (Yu et al., 2019). Feature alignment methods aim to reduce the difference between source and target domains at the feature level without taking into account the difference at pixel level (like style, color, shape etc.) which are present between the source and targets domains.
Object detection algorithms allow to enable many interesting applications which can be implemented in different devices, such as smartphones and wearable devices. In the context of a cultural site, implementing these algorithms in a wearable device, such as a pair of smart glasses, allow to enable the use of augmented reality (AR) to show extra information about the artworks and enrich the visitors’ experience during their tour. However, object detection algorithms require to be trained on many well annotated examples to achieve reasonable results. This brings a major limitation since the annotation process requires human supervision which makes it expensive in terms of time and costs. A possible solution to reduce these costs consist in exploiting tools to automatically generate synthetic labeled images from a 3D model of the site. However, models trained with synthetic data do not generalize on real images acquired in the target scenario in which they are supposed to be used. Furthermore, object detectors should be able to work with different wearable devices or different mobile devices, which makes generalization even harder. In this paper, we present a new dataset collected in a cultural site to study the problem of domain adaptation for object detection in the presence of multiple unlabeled target domains corresponding to different cameras and a labeled source domain obtained considering synthetic images for training purposes. We present a new domain adaptation method which outperforms current state-of-the-art approaches combining the benefits of aligning the domains at the feature and pixel level with a self-training process. We release the dataset at the following link https://iplab.dmi.unict.it/OBJ-MDA/ and the code of the proposed architecture at https://github.com/fpv-iplab/STMDA-RetinaNet.
A novel multiscale feature adversarial fusion network for unsupervised cross-domain fault diagnosis
2022, Measurement: Journal of the International Measurement Confederation
Citation Excerpt :
The DA aims to adapt a model built in a source domain for use in a different but related unlabeled target domain, and it releases the assumption that the data from the source and target domains obey the same distribution by learning the shared feature representation invariant to both domains. Given the DA’s prominent characteristics, it has gained promising applications, including image classification, natural language processing, and object recognition [13–15]. In the current literature, minimizing domain distribution discrepancy and constructing adversarial learning strategies have become the most used approaches in DA.
Behind the brilliance of traditional deep learning-based diagnosis methods, the assumption that training data and test data share the same distribution greatly hinders their further application. Because the data distribution discrepancy is common and inevitable in real industrial scenarios due to operating condition variation, it will significantly degrade models’ diagnosis performance. Moreover, scarce labeled data can be obtained, and labeling sufficient data is extremely difficult and expensive in engineering applications. Considering these challenges, this paper proposes a novel multiscale feature adversarial fusion network (MFAFN) for rotating machinery fault transfer diagnosis. In our method, the multiscale structural network is employed to extract abundant and complementary multiscale features. The key highlight of MFAFN is that the transferability-based duplex attention mechanism (TDAM) is elaborated and bidirectionally coupled into the network training. Benefiting from TDAM, the representation and learning of the extracted features at different scales are differentially enhanced, thus improving the fused shared feature’s transferability and model’s adaptability. Furthermore, a double-level adversarial training strategy is implemented to ensure effective adaptation. Therefore, MFAFN can learn domain-invariant diagnosis knowledge rich in discriminative fault information, thereby performing better on the unlabeled target domain. Experimental results of extensive diagnosis tasks built on two datasets and comparisons with other methods validate MFAFN’s effectiveness and superiority.
A comprehensive survey on object detection in Visual Art: taxonomy and challenge
2024, Multimedia Tools and Applications

View all citing articles on Scopus

¹: These authors are co-first authors and contributed equally to this work.

View full text