Are emotional objects visually salient? The Emotional Maps Database
Introduction
Most images depicting real-life scenes are composed of more informative regions (e.g., foreground objects, human figures, facial display of emotions), which attract visual attention, and less informative ones (e.g., blank walls, background objects, homogeneous surfaces), which are mostly ignored [1]. In many cases, a detail within a scene creates meaning and evokes emotions in the viewer: a wound or a smile can transform an otherwise neutral scene into an intensely emotional one. In this research, we analyze the distribution of meaningful content within images commonly used in the studies on emotions. In particular, we explore how visually salient and clearly delimited are the key elements in the positive, negative, and neutral scenes. We also provide a set of “emotional meaning maps” for the commonly used emotional images databases.
Emotion-evoking stimuli rapidly attract attention and are processed in a prioritized way [2], [3], [4], [5], [6], [7], [8], [9]. To study this prioritization (as well as other aspects of emotional processing), emotional images depicting real-life scenes have been used in a multitude of studies over the past 40 years [10], [11], [12]. To achieve better experimental control over the emotion induction, researchers in social sciences use databases that provide emotionally charged photographs with standardized ratings of their emotional arousal and valence, obtained in the process of a large-sample evaluation [10]. Creators of those databases often use the dimensional concept of emotions [13], assuming emotion can be described on two basic dimensions: arousal and valence. Valence determines if the stimulus is pleasant or unpleasant, while arousal determines whether the stimulus is calming or exciting. Both arousal and valence dimensions of a stimulus, be it an image, sound, or word, can be conveniently assessed using the Self-Assessment Manikin Scale, a graphical 9-point rating scale devised by Bradley and Lang [10] which became a standard tool for emotional stimuli evaluation. This approach is especially advantageous when regression models are used, as in this study. [14].
Among databases of natural emotional images are EmoPics [15], Geneva Affective Picture Database (GAPED) [16], the International Affective Picture System (IAPS) [17], and the Nencki Affective Picture System (NAPS) [18]. All of them contain natural, real-life images, depicting a broad cross-section of scene categories, including people, social interactions, animals, artificial objects, landscapes, interiors, erotica, food, etc.
The distribution of meaningful content within these scenes has rarely been analyzed systematically, even though, in many cases, detailed information regarding the scene composition is of primary interest. For example, in eye-tracking studies, the information on the key objects’ location is routinely used to select regions of interest - a preparatory step necessary for more refined analyses of fixation patterns [19], [20]. Meaning-driven regions of interest have also been employed in the eye-tracking studies involving emotional images presentation [5], [21], [22], [23], [24], [25], [26], [27], [28]. In computer vision, the most important emotional region’s location can be useful as ground truth for DNN algorithms [29], [30].
The selection of the most informative or meaningful regions of an image can be performed either manually or algorithmically, including machine learning. The algorithmic approach was implemented in a variety of visual saliency models (for reviews [31], [32], [33]) including machine learning approaches, e.g. [34], [35], [36], [37]. Some saliency models take into account low-level local features such as edges or contrast of the luminance and colors, e.g. [38], [39]; others are based on the analysis of higher-order features, such as objecthood, e.g. [40], [41]; even more specialized ones take into account pre-trained information regarding the usual scene composition and its elements, e.g. [34], [35], [38], [42], [43] or even combine few models to build multi-stage learning, e.g. [37]. The main advantage of those approaches is their ease of use, efficiency, and repeatability.
Yet, in the case of emotional images, visual saliency seems to be a relatively poor predictor of attention engagement - and thus, presumably, a poor predictor of the distribution of the most meaningful regions within a scene. The experiments by Humphrey, Underwood, and Lambert [24] and Niu and colleagues [5] showed that emotional visual objects attract attention irrespective of their visual saliency. Moreover, Pilarczyk and Kuniecki [25] showed that the visual saliency alone, when decoupled from the meaning, does not attract attention better than chance, particularly in the case of emotional images. The primacy of meaning over visual saliency in attention guidance was also confirmed in other studies, not employing emotional images [44], [45], [46], [47], [48]. Still, the relation of the meaning and the visual saliency is somewhat intertwined. Elazary and Itti [49] showed that interesting objects within neutral scenes tend to be also visually salient, as measured with Graph-Based Visual Saliency (GBVS) [50], a purely bottom-up algorithm based on simple visual features.
When the algorithmic approach is not sufficient, researchers make use of manual segmentation of a scene. However, when it comes to databases of emotional images featuring demarcation of the most meaningful region, to our knowledge, researchers are limited to EMOd [29] and EmotionROI [30], both originating from the computer vision community. EMOd is composed of 1019 images (321 from IAPS [17] and 698 from the Internet) and features outlines of most dominant objects along with their categorizations, providing semantic segmentation. However, since the object markings and descriptions were made by only three participants, the EMOd does not provide high-resolution meaning maps, and does not allow for pixel by pixel representation of emotional load, analogously to saliency maps.
In contrast, the EmotionROI image database by Peng and colleagues [51] is provided with an accompanying Emotion Stimuli Map (ESM) created by averaging selections of regions that best capture the emotional meaning of an image [30], made by Mechanical Turk participants. EmotionROI consists of 1980 images collected on Flickr by the authors, using search keywords matching six universal emotions identified by Ekman and Friesen [52] (anger, disgust, fear, joy, sadness, surprise). Apart from ESM, EmotionROI also provides valence and arousal evaluations for each image, made using a tool modeled on the Self-Assessment Manikin Scale [10].
The aim of our project was twofold. First, we aimed to create a database of meaning maps showing the distribution of emotional content in photographs that were rigorously standardized in terms of emotional valence and arousal. The Emotional Maps Database (EMD) is conceived as a complementary tool for the established and widely used sets of emotional images: the EmoPics [15], the Geneva Affective Picture Database (GAPED) [16], the International Affective Picture System (IAPS) [17], and the Nencki Affective Picture System (NAPS) [18]. Second, we wanted to explore the similarity between the meaning and visual saliency maps in emotional images. To this end, we compared our EMD maps (as well as ESM maps, see Appendix A) with maps generated by three saliency models: GBVS [50], Proto-objects [41], and SalGAN [53]. We also aimed to explore the participants’ agreement in selecting the most meaningful region in emotional scenes.
Our main contributions are as follows:
- 1)
We present a new database of 950 meaning maps for images commonly used in emotion research. Meaning maps represent the spatial distribution of emotionally charged regions selected by a large group of participants. Thus, the maps have properties similar to saliency maps and can be directly compared with them.
- 2)
We provide comprehensive analysis on how emotional characteristics of images, i.e., valence (pleasant-unpleasant) and arousal (calm-arousing) dimensions, influence the similarity of meaning and saliency distributions. We analyze saliency models employing different saliency definitions and mechanisms of its detection. We also use various similarity measures to compare saliency and meaning maps.
- 3)
We find that, despite being clearly delineated and easily detected by human participants, high-arousing and negative objects are less effectively detected by saliency models.
Section snippets
Participants
We recruited 296 participants (244 women, aged 18–52, M = 22, SD = 4)1. The participants were required to have normal or corrected-to-normal eyesight and intact color vision. The participants were recruited with the Jagiellonian University advertisement mailing system, and the majority of them were local university students. For participation in the study, they received course credit or payment. All participants gave informed consent prior to the
Results
The relationship between valence, arousal, and similarity between emotional meaning maps and saliency maps was highly significant for all saliency models and all similarity measures, as evidenced by F values and their probabilities of all regression models (Table 2). Valence was positively related to saliency in all models except Proto-object (Fig. 3). This signifies that the most meaningful regions in negative images tend to be relatively less visually salient than in positive ones. Arousal
Discussion
In this study, we compared the emotional meaning maps (created from manual selections of key objects by the participants) with the maps generated automatically by three saliency models, using three different similarity measures. We investigated how emotional valence and arousal of an image influence similarity in map-making between humans and the algorithms and how it affects the agreement between the participants in selecting images’ key regions.
Comparing the emotional meaning maps with
CRediT authorship contribution statement
Joanna Pilarczyk: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Project administration, Validation, Visualization, Writing – original draft, Writing - review & editing. Weronika Janeczko: Investigation, Data curation, Writing – original draft, Writing - review & editing. Radosław Sterna: Data curation, Writing – original draft, Writing - review & editing. Michał Kuniecki: Conceptualization, Formal analysis, Funding acquisition, Methodology, Project
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Science Center in Poland (grant numbers 2012/07/E/HS6/01046 and 2017/25/B/HS6/00758). During work on the paper, Radosław Sterna was supported by the funding from the budget for science in the years 2019-2023, as a research project (project number DI2018 015848) under the Diamond Grant program financed by the Ministry of Education and Science of Poland. We would like to thank Piotr Wójcik for technical support in the development of the computer tool for
References (70)
- et al.
Brain mechanisms for emotional influences on perception and attention: what is magic and what is not
Biol. Psychol.
(2013) - et al.
Measuring emotion: the self-assessment manikin and the semantic differential
J. Behav. Ther. Exp. Psychiatry
(1994) - et al.
Emotion and the motivational brain
Biol. Psychol.
(2010) - et al.
Best practices in eye tracking research
Int. J. Psychophysiol.
(2020) - et al.
Neural correlates of attentional deployment within unpleasant pictures
NeuroImage
(2013) - et al.
Phase of the menstrual cycle affects engagement of attention with emotional images
Psychoneuroendocrinology
(2019) - et al.
Blue blood, red blood. How does the color of an emotional scene affect visual attention and pupil size?
Vision Res.
(2020) - et al.
Visual saliency guided complex image retrieval
Pattern Recogn. Lett.
(2020) - et al.
A saliency-based search mechanism for overt and covert shifts of visual attention
Vision Res.
(2000) - et al.
A model of proto-object based saliency
Vision Res.
(2014)