Elsevier

Cortex

Volume 143, October 2021, Pages 127-147
Cortex

Research Report
Cortical networks of dynamic scene category representation in the human brain

https://doi.org/10.1016/j.cortex.2021.07.008Get rights and content

Abstract

Humans have an impressive ability to rapidly process global information in natural scenes to infer their category. Yet, it remains unclear whether and how scene categories observed dynamically in the natural world are represented in cerebral cortex beyond few canonical scene-selective areas. To address this question, here we examined the representation of dynamic visual scenes by recording whole-brain blood oxygenation level-dependent (BOLD) responses while subjects viewed natural movies. We fit voxelwise encoding models to estimate tuning for scene categories that reflect statistical ensembles of objects and actions in the natural world. We find that this scene-category model explains a significant portion of the response variance broadly across cerebral cortex. Cluster analysis of scene-category tuning profiles across cortex reveals nine spatially-segregated networks of brain regions consistently across subjects. These networks show heterogeneous tuning for a diverse set of dynamic scene categories related to navigation, human activity, social interaction, civilization, natural environment, non-human animals, motion-energy, and texture, suggesting that the organization of scene category representation is quite complex.

Introduction

A primary aim of visual neuroscience is to shed light on how the human brain represents diverse information in natural scenes. Behavioral research on scene perception suggests that humans categorize scenes to more efficiently process the wealth of information in visual scenes (Greene & Oliva, 2009; Konkle, Brady, Alvarez, & Oliva, 2010; Rousselet, Joubert, & Fabre-Thorpe, 2005). Therefore, it is likely that information on scene categories is represented across cortex. Consistent with this notion, previous neuroimaging studies have demonstrated that the category of a visual scene could be classified among a limited number of basic categories (e.g., beaches, forests, mountains) based on blood-oxygen level-dependent (BOLD) responses in classical scene-selective regions (parahippocampal place area, PPA; retrosplenial complex, RSC; and occipital place area, OPA), object-selective lateral occipital complex (LO), and anterior visual cortex (Epstein & Morgan, 2012; Jung, Larsen, & Walther, 2018; Walther, Caddigan, Fei–Fei, & Beck, 2009; Walther, Chai, Caddigan, Beck, & Fei–Fei, 2011). A common approach in these studies was to operationally define visual scenes into few non-overlapping categories. However, natural scene categories might show varying degrees of statistical correlation, and a real-world scene might be characterized under several distinct categories. In addition, because these studies used static scenes, they did not possess the necessary tools to demonstrate how dynamic scene categories are represented in the human brain.

To examine the statistics of natural scene categories, a recent study (Stansbury, Naselaris, & Gallant, 2013) used a data-driven algorithm to procure a broad set of scene categories wherein potential similarities between the categories were also taken into account. In this approach, each scene category is defined as a list of presence probabilities for a large array of constituent objects that appear within natural scenes. Once the algorithm learns a set of categories, the likelihood that a given scene belongs to each of the learned categories can be inferred based on the objects within the scene. This scene category model has been reported to yield improved predictions of single-voxel BOLD responses in classical face- and scene-selective areas compared to an alternative model based on the presence of a few diagnostic objects that frequently appeared in the presented natural images (Stansbury et al., 2013). This result raises the possibility that object co-occurrence statistics form the basis of scene category definitions above and beyond individual objects present in scenes.

Stansbury et al. defined categories of static scenes via their constituent objects and focused on category responses in classical scene-selective regions like many prior studies on scene representation (Epstein & Morgan, 2012; Jung et al., 2018; Walther et al., 2009, 2011). Yet, several recent studies imply that much of anterior visual cortex might be organized by differential tuning of voxels for actions within visual scenes (Tarhan & Konkle, 2020; Çukur, Huth, Nishimoto, & Gallant, 2016). In fact, real-world scenes contain dynamic interactions between objects and actions leading to more elaborate categories (Greene, Baldassano, Esteva, Beck, & Fei–Fei, 2016), and they have been reported to elicit widespread responses across visual cortex (Deen, Koldewyn, Kanwisher, & Saxe, 2015; Epstein & Baker, 2019; Isik, Koldewyn, Beeler, & Kanwisher, 2017; Maguire et al., 1998). Therefore, it is likely that natural scene categories based on co-occurrence of objects and actions are represented across broadly distributed networks in the human brain.

Here, we sought to learn high-level features that capture scene-category information in dynamic visual scenes, and to examine how this information is represented across cerebral cortex. We first recorded BOLD responses while subjects viewed a large set of natural movies that contained 5252 distinct objects and actions. To identify scene-category features, we employed a statistical learning algorithm that learned a large set of categories on the basis of the co-occurrence statistics of objects and actions in the natural world. We then used the learned scene categories within a voxelwise modeling framework (Huth, Nishimoto, Vu, & Gallant, 2012; Nishimoto et al., 2011; Çukur et al., 2016; Çukur, Nishimoto, Huth, & Gallant, 2013) to estimate scene-category tuning profiles in single voxels across cerebral cortex. Subsequently, we performed a clustering analysis in order to reveal large-scale networks of brain regions that differ in their scene-category tuning.

Section snippets

Subjects

Five healthy human subjects (all male, ages 25–32 years) with normal or corrected-to-normal vision participated in this study. MRI data were collected in five separate scan sessions: three sessions for the main experiment, one session for acquiring functional localizers, and one session for acquiring anatomical data. Experimental protocols were approved by the Committee for the Protection of Human Subjects at the University of California, Berkeley. All subjects gave written informed consent

Results

To investigate the nature of high-level scene information that is represented across the cerebral cortex, we recorded BOLD responses while subjects passively viewed 2 h of natural movies. We used voxelwise modeling to assess scene representations in single voxels. We fit a scene-category model to measure tuning for scene categories (e.g., an urban street, a forest) that reflect co-occurrence statistics of objects and actions in natural scenes. Model performance was evaluated by calculating

Discussion

The aim of this study was to investigate representation of dynamic visual scenes across the human brain. To do this, we fit a scene-category model to measure voxelwise tuning for hundreds of scene categories, where categories were learned inductively as statistical ensembles of objects and actions in natural scenes. We find that this scene-category model explains a significant portion of the response variance broadly across cerebral cortex. We then performed cluster analysis on voxelwise tuning

CRediT author statement

Emin Çelik: Conceptualization, Methodology, Software, Validation, Formal analysis, Writing – Review & Editing, Visualization.

Umit Keles: Conceptualization, Methodology, Software, Formal analysis, Writing – Original Draft, Visualization.

İbrahim Kiremitçi: Validation, Formal analysis.

Jack L. Gallant: Writing – Review & Editing.

Tolga Çukur: Conceptualization, Investigation, Resources, Writing – Review & Editing, Supervision, Project Administration, Funding Acquisition.

Open practices

The study in this article earned Open Data and Open Materials badges for transparent practices. Data for this study can be found at https://crcns.org/data-sets/vc/vim-2 and https://crcns.org/data-sets/vc/vim-4.

Acknowledgments

The authors declare no competing financial interests. The work was supported in part by a National Eye Institute Grant (EY019684), by a Marie Curie Actions Career Integration Grant (PCIG13-GA-2013-618101), by a European Molecular Biology Organization Installation Grant (IG 3028), by a TUBA GEBIP 2015 fellowship, and by a Science Academy BAGEP 2017 award. We thank D. Stansbury, A. Huth, and S. Nishimoto for assistance in various aspects of this research. We report how we determined our sample

References (100)

  • A.G. Huth et al.

    A continuous semantic space describes the representation of thousands of object and action categories across the human brain

    Neuron

    (2012)
  • M. Jenkinson et al.

    Improved optimization for the robust and accurate linear registration and motion correction of brain images

    Neuroimage

    (2002)
  • M.X. Lowe et al.

    Neural representation of geometry and surface properties in object and scene perception

    Neuroimage

    (2017)
  • G.L. Malcolm et al.

    Making sense of real-world scenes

    Trends in Cognitive Sciences

    (2016)
  • S. Nishimoto et al.

    Reconstructing visual experiences from brain activity evoked by natural movies

    Current Biology

    (2011)
  • M.I. Posner et al.

    Analyzing and shaping human attentional networks

    Neural Networks

    (2006)
  • R. Saxe

    Uniquely human social cognition

    Current Opinion in Neurobiology

    (2006)
  • A. Schindler et al.

    Visual high-level regions respond to high-level stimulus content in the absence of low-level confounds

    Neuroimage

    (2016)
  • J.T. Serences et al.

    Computational advances towards linking BOLD and behavior

    Neuropsychologia

    (2012)
  • D.E. Stansbury et al.

    Natural scene statistics account for the representation of scene categories in human visual cortex

    Neuron

    (2013)
  • D.M. Watson et al.

    Patterns of response to visual scenes are linked to the low-level properties of the image

    Neuroimage

    (2014)
  • P. Agrawal et al.

    Pixels to voxels: Modeling visual representation in the human brain

    (2014)
  • G.K. Aguirre et al.

    Environmental knowledge is subserved by separable dorsal/ventral neural areas

    Journal of Neuroscience

    (1997)
  • T.J. Andrews et al.

    Low-level properties of natural images predict topographic patterns of neural response in the ventral visual pathway

    Journal of Vision

    (2015)
  • D. Arthur et al.

    k-means++: The advantages of careful seeding

  • H.C. Barrett

    A hierarchical model of the evolution of human brain specializations

    Proceedings of the National Academy of Sciences of the United States of America

    (2012, June 26)
  • Y. Benjamini et al.

    The control of the false discovery rate in multiple testing under dependency

    The Annals of Statistics

    (2001)
  • M. Bilalić

    Revisiting the role of the fusiform face area in expertise

    Journal of Cognitive Neuroscience

    (2016)
  • S. Bird et al.

    Natural language processing with Python: Analyzing text with the natural language toolkit

    (2009)
  • D.M. Blei et al.

    Latent dirichlet allocation

    Journal of Machine Learning Research

    (2003)
  • C.F. Cadieu et al.

    Deep neural networks rival the representation of primate IT cortex for core visual object recognition

    Plos Computational Biology

    (2014)
  • G. Calvert et al.

    Reading speech from still and moving faces: The neural substrates of visible speech

    Journal of Cognitive Neuroscience

    (2003)
  • A.E. Cavanna et al.

    The precuneus: A review of its functional anatomy and behavioural correlates

    Brain

    (2006)
  • D.L. Chen et al.

    Collecting highly parallel data for paraphrase evaluation

  • R.M. Cichy et al.

    Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence

    Scientific Reports

    (2016)
  • T. Çukur et al.

    Functional subdomains within human FFA

    Journal of Neuroscience

    (2013)
  • T. Çukur et al.

    Functional subdomains within scene-selective cortex: Parahippocampal place area, retrosplenial complex, and occipital place area

    Journal of Neuroscience

    (2016)
  • T. Çukur et al.

    Attention during natural vision warps semantic representation across the human brain

    Nature Neuroscience

    (2013)
  • S.V. David et al.

    Predicting neuronal responses during natural vision

    Network: Computation in Neural Systems

    (2005)
  • B. Deen et al.

    Functional organization of social perception and cognition in the superior temporal sulcus

    Cerebral Cortex

    (2015)
  • D.D. Dilks et al.

    The occipital place area is causally and selectively involved in scene perception

    Journal of Neuroscience

    (2013)
  • P. Downing et al.

    A cortical area selective for visual processing of the human body

    Science

    (2001)
  • S.A. Engel et al.

    Retinotopic organization in human visual cortex and the spatial precision of functional MRI

    Cerebral Cortex

    (1997)
  • R.A. Epstein

    Neural systems for visual scene recognition

  • R.A. Epstein et al.

    Scene perception in the human brain

    Annual Review of Vision Science

    (2019)
  • R. Epstein et al.

    A cortical representation of the local visual environment

    Nature

    (1998)
  • E.F. Ester et al.

    Categorical biases in human occipitoparietal cortex

    The Journal of Neuroscience

    (2020)
  • J.S. Gao et al.

    Pycortex: An interactive surface visualizer for fMRI

    Frontiers in Neuroinformatics

    (2015)
  • I. Gauthier et al.

    Expertise for cars and birds recruits brain areas involved in face recognition

    Nature Neuroscience

    (2000)
  • I. Gauthier et al.

    The fusiform “face area” is part of a network that processes faces at the individual level

    Journal of Cognitive Neuroscience

    (2000)
  • Cited by (0)

    View full text