Elsevier

Cognition

Volume 199, June 2020, 104092
Cognition

Original Article
How optimal is word recognition under multimodal uncertainty?

https://doi.org/10.1016/j.cognition.2019.104092Get rights and content

Abstract

Identifying a spoken word in a referential context requires both the ability to integrate multimodal input and the ability to reason under uncertainty. How do these tasks interact with one another? We study how adults identify novel words under joint uncertainty in the auditory and visual modalities, and we propose an ideal observer model of how cues in these modalities are combined optimally. Model predictions are tested in four experiments where recognition is made under various sources of uncertainty. We found that participants use both auditory and visual cues to recognize novel words. When the signal is not distorted with environmental noise, participants weight the auditory and visual cues optimally, that is, according to the relative reliability of each modality. In contrast, when one modality has noise added to it, human perceivers systematically prefer the unperturbed modality to a greater extent than the optimal model does. This work extends the literature on perceptual cue combination to the case of word recognition in a referential context. In addition, this context offers a link to the study of multimodal information in word meaning learning.

Introduction

Language uses symbols expressed in one modality — the auditory modality, in the case of speech — to communicate about the world, which we perceive through many different sensory modalities. Consider hearing someone yell “bee!” at a picnic, as a honey bee buzzes around the food. Identifying a word involves processing the auditory information as well as other perceptual signals (e.g., the visual image of the bee, the sound of its wings, the sensation of the bee flying by your arm). A word is successfully identified when information from these modalities provides convergent evidence.

However, word identification takes place in a noisy world, and the cues received through each modality may not provide a definitive answer. On the auditory side, individual acoustic word tokens are almost always ambiguous with respect to the particular sequence of phonemes they represent, which is due to the inherent variability of how a phonetic category is realized acoustically (Hillenbrand et al., 1995). Moreover, some tokens may be distorted additionally by mispronunciation or ambient noise. Perhaps the speaker was yelling “pea” and not “bee.” Similarly, a sensory impression may not be enough to make a definitive identification of a visual category.1 Perhaps the insect was a beetle or a fly instead. How does the listener deal with such multimodal uncertainty to recognize the speaker's intended word?

As a simplified case study of early word learning, the task of matching sounds to corresponding visual objects has been studied extensively in the developmental literature. For example, many studies focus on how children might succeed in this type of task despite referential ambiguity ([Medina et al., 2011], [Pinker, 1989], [Smith and Yu, 2008], [Suanda et al., 2014], [Vlach and Johnson, 2013], [Vouloumanos, 2008], [Yurovsky and Frank, 2015]). However, even when they have learned the exact meaning of a word, observers (both children and adults) often still find it challenging to recognize which word the speaker has uttered, especially under noise ([Mattys et al., 2012], [Peelle, 2018]). The purpose of the current study is thus to explore word recognition by adults under multimodal uncertainty, focusing on the special case where people have access to multimodal cues from the auditory speech and the visual referent. In the General Discussion, we return to the question of how these findings relate to questions about word learning.

One rigorous way to approach this question is through conducting an ideal observer analysis. This research strategy provides a characterization of the task/goal and shows what the optimal performance should be under this characterization.2 When there is uncertainty in the input, the ideal observer performs an optimal probabilistic inference. For example, in order to recognize an ambiguous linguistic input, the model uses all available probabilistic knowledge in order to maximize the accuracy of this recognition. The ideal observer model can be seen as a theoretical upper limit on performance. It is not so much a realistic model of human performance, as much as a baseline against which human performance can be compared ([Geisler, 2003], [Rahnev and Denison, 2018]). When there is a deviation from the ideal, it can reveal extra constraints on human cognition, such as limitations on the working memory or on attentional resources. This approach has had a tremendous impact not only on speech-related research ([Clayards et al., 2008], [Feldman et al., 2009], [Kleinschmidt and Jaeger, 2015], [Norris and McQueen, 2008]) but also on many other disciplines in the cognitive sciences (for reviews, see [Chater and Manning, 2006], [Knill and Pouget, 2004], [Tenenbaum et al., 2011]).

Some prior ideal observer studies are closely related to the question we are addressing in the current work. For instance, Clayards et al. (2008) simulated auditory uncertainty by manipulating the probability distribution of a cue (Voice Onset Time) that differentiated similar words (e.g., “beach” and “peach”). They found that humans were sensitive to these probabilistic cues and their judgments closely reflected the optimal predictions. Moreover, Feldman et al. (2009) studied the perceptual magnet effect, a phenomenon that involves reduced discriminability near prototypical sounds in the native language (Kuhl, 1991), showing that this effect can be explained as the consequence of optimally solving the problem of perception under uncertainty (see also Kronrod et al., 2016).

Besides the acoustic cues explored in Clayards et al. (2008) and Feldman et al. (2009), there is extensive evidence that information from the visual modality, such as the speaker's facial features, also influences speech understanding (see Campbell, 2008 for a review). Bejjanki et al. (2011) offered a mathematical characterization of how probabilistic cues from speech and lip movements can be optimally combined. They showed that human performance during audio-visual phonemic labeling was consistent (at least at the qualitative level) with the predictions of an ideal observer. This previous research did not, however, study speech understanding when visual information was obtained through the referential context rather than through observation of the speaker's face. Although some experimental findings show that information about the identity of a referent can be integrated with linguistic information to resolve lexical and syntactic ambiguities in speech (e.g., [Eberhard et al., 1995], [Spivey et al., 2002], [Tanenhaus et al., 1995]), to our knowledge no study has offered an ideal observer analysis of this task (as we do here).

Combining information between words and visual referents might seem similar to audio-visual speech integration (e.g., Bejjanki et al., 2011), but there are at least two fundamental differences between these two cases, and both can influence the way the auditory and visual cues are combined.

First, in the case of audio-visual speech, both modalities offer information about the same underlying speech category. They differ only in terms of their informational reliability. In a referential context, however, the auditory and visual modalities play different roles in the referential process — the auditory input represents the symbol whereas the visual input represents the meaning (and these differences are in addition to possible differences in informational reliability). Speech is claimed to have a privileged status compared to other sensory stimuli ([Edmiston and Lupyan, 2015], [Lupyan and Thompson-Schill, 2012], [Vouloumanos and Waxman, 2014], [Waxman and Gelman, 2009], [Waxman and Markow, 1995]), and this privilege is suggested to be specifically related to the ability to refer (Waxman and Gelman, 2009).3 Thus, in a referential context, it is possible that listeners do not treat the auditory and visual modalities as equivalent sources of information. Instead, there could be a (potentially sub-optimal) bias for the auditory modality beyond what is expected from informational reliability alone.

Second, in the case of audio-visual speech, the auditory and visual stimuli are expected to be perceptually correlated. The expectation for this correlation is strong enough that when there is a mismatch between the auditory and visual input, they are still integrated into a unified (but illusory) percept (e.g., the McGurk Effect; McGurk and MacDonald, 1976). In the case of referential language, however, the multimodal association is by nature arbitrary ([Greenberg, 1957], [Saussure, 1916]). For instance, there is no logical or perceptual connection between the sound “bee” and the corresponding insect. Moreover, variation in the way the sound “bee” is pronounced is generally not expected to correlate perceptually with variation in the shape (or any other visual property) in the category of bees. In sum, cue combination in the case of arbitrary audio-visual associations (word-referent) is likely to be less automatic, more effortful, and therefore less conducive to optimal integration than it is in the case of perceptually correlated associations (as in audio-visual speech perception).

We investigate how cues from the auditory and the visual modality are combined to recognize novel words in a referential context. In particular, we study how this combination is performed under various degrees of uncertainty in both the auditory and the visual modality. Imagine, for example, that someone is uncertain whether they heard “pea” or “bee”. Does this uncertainty make them rely more on the referent (e.g., the object being pointed at)? Or, if they are not sure if they saw a bee or a fly, does this uncertainty make them rely more on the sound? More importantly, when input in both modalities is uncertain to varying degrees, do they weight each modality according to its relative reliability (the optimal strategy), or do they over-rely on a particular modality?

We begin by proposing an ideal observer model that performs the combination in an optimal fashion. We then compare the predictions of the optimal model to human responses. Humans can deviate from the ideal for several reasons. For instance, as mentioned above, a sub-optimality can be induced by the privileged status of a particular modality or by the arbitrariness of the referential association. In order to study possible patterns of sub-optimality, we compare the optimal normative model to a descriptive model (which is fit to actual responses). Comparing parameter estimates between these two formulations allows us to quantify the degree of deviation from optimality.

We tested the ideal observer model's predictions in four behavioral experiments where we varied the source of uncertainty. In Experiment 1, audio-visual tokens were ambiguous with respect to their category membership (in addition to sensory noise). In Experiment 2, we intervened by adding environmental noise that degraded information from the auditory modality, and in Experiment 3 we intervened by adding environmental noise that degraded information from the visual modality. Finally, in Experiment 4, we replicated Experiment 1 with a higher power design, allowing us to test cue combination at the individual level.

Section snippets

Paradigm and models

In this section, we first briefly introduce the multimodal combination task. Then we explain how behavior in this paradigm can be characterized optimally with an ideal observer model.

Experiment 1

In this experiment, we test the predictions of the model in the case where uncertainty is due to categorical variability (i.e., ambiguity in terms of category membership) and inherent sensory noise. We do not add any external noise to the background. Thus, we test the following (normative) cue weighting scheme.8

Experiment 2

In this Experiment, we explored the effect of added environmental noise σE2 on performance. We tested a case where the background noise was added to the auditory modality. We were interested to know if participants would treat this new source of uncertainty as predicted by the optimal model, that is, according to the following weighting scheme:βa1σA2=1σAC2+σAN2+σAE2βv1σV2=1σVC2+σVN2.

The alternative hypothesis is that noise in one modality leads to a systematic preference for the non-noisy

Experiment 3

Similar to Experiment 2, we were interested to know if participants would treat additional uncertainty as predicted by the optimal model, that is, according to the following weighting scheme:βa1σA2=1σAC2+σAN2βv1σV2=1σVC2+σVN2+σVE2.

The alternative hypothesis is that noise in the visual modality would lead to a preference for the auditory input, just like noise in the auditory modality lead to a preference for the visual input in Experiment 2.

Experiment 4

As we noted earlier, we did not have enough statistical power in Experiment 1 to fit a different model for each participant. Thus, here we used a higher power design, allowing us to collect the number of data points necessary to model cue combination at the individual level.

General discussion

In the current paper, we explored word recognition under uncertainty about both words and their referents. We conducted an ideal observer analysis of this task whereby a model provided predictions about how information from each modality should be combined in an optimal fashion. The predictions of the model were tested in a series of four experiments where instances of both the form and the meaning were ambiguous with respect to their category membership only (Experiments 1 and 4), when

Conclusions and future research directions

Our work used an ideal observer model to study word recognition under audio-visual uncertainty. This framework enabled us not only to test optimality but also to examine systematically how and by how much people deviate from optimality in their combination strategies. Thus, our work is part of a growing effort to go beyond optimality tests — which have limited explanatory power — and use models that also allow us to identify and explain various patterns of sub-optimality in human behavior (

Data availability

All data and code for these analyses are available at https://github.com/afourtassi/WordRec

Conflicts of interest

None of the authors have any financial interest or a conflict of interest regarding this work and this submission.

Acknowledgements

This work was supported by a post-doctoral grant from the Fyssen Foundation.

References (80)

  • A. Vouloumanos

    Fine-grained sensitivity to statistical information in adult word learning

    Cognition

    (2008)
  • A. Vouloumanos et al.

    Listen up! Speech is for thinking during infancy

    Trends in Cognitive Sciences

    (2014)
  • S. Waxman et al.

    Words as invitations to form categories: Evidence from 12- to 13-month-old infants

    Cognitive Psychology

    (1995)
  • H. Yeung et al.

    Learning words’ sounds before learning how words sound: 9-Month-olds use distinct objects as cues to categorize speech information

    Cognition

    (2009)
  • J.R. Anderson

    The adaptive character of thought

    (1990)
  • K.R. Bankieris et al.

    Sensory cue-combination in the context of newly learned categories

    Scientific Reports

    (2017)
  • W.R. Barnhart et al.

    Different patterns of modality dominance across development

    Acta Psychologica

    (2018)
  • D. Bates et al.

    Nonlinear regression analysis and its applications

    (1988)
  • V. Bejjanki et al.

    Cue integration in categorical tasks: Insights from audio-visual speech perception

    PLoS ONE

    (2011)
  • E. Bergelson et al.

    At 6 to 9 months, human infants know the meanings of many common nouns

    Proceedings of the National Academy of Sciences

    (2012)
  • P. Bloom

    How children learn the meanings of words

    (2000)
  • R. Campbell

    The processing of audio-visual speech: Empirical and neural bases

    Philosophical Transactions of the Royal Society of London B: Biological Sciences

    (2008)
  • M. Clayards et al.

    Perception of speech reflects optimal use of probabilistic speech cues

    Cognition

    (2008)
  • J.A. Coady et al.

    Phonological neighbourhoods in the developing lexicon

    Journal of Child Language

    (2003)
  • F.B. Colavita

    Human sensory dominance

    Attention, Perception, & Psychophysics

    (1974)
  • S. Creel

    Phonological similarity and mutual exclusivity: On-line recognition of atypical pronunciations in 3–5-year-olds

    Developmental Science

    (2012)
  • I. Dautriche et al.

    Wordform similarity increases with semantic similarity: An analysis of 100 languages

    Cognitive Science

    (2017)
  • K. Eberhard et al.

    Eye movements as a window into real-time spoken language comprehension in natural contexts

    Journal of Psycholinguistic Research

    (1995)
  • M.O. Ernst et al.

    Humans integrate visual and haptic information in a statistically optimal fashion

    Nature

    (2002)
  • N. Feldman et al.

    The influence of categories on perception: Explaining the perceptual magnet effect as optimal statistical inference

    Psychological Review

    (2009)
  • A. Fourtassi et al.

    Exploring the relative role of bottom-up and top-down information in phoneme learning

    Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: Short papers) (Vol. 2)

    (2014)
  • D. Freedman et al.

    Categorical representation of visual stimuli in the primate prefrontal cortex

    Science

    (2001)
  • W.S. Geisler

    Ideal observer analysis

    The visual neurosciences

    (2003)
  • J. Greenberg

    Essays in linguistics

    (1957)
  • D. Harwath et al.

    Unsupervised learning of spoken language with visual context

    Advances in neural information processing systems

    (2016)
  • J. Hillenbrand et al.

    Acoustic characteristics of American english vowels

    Journal of the Acoustical Society of America

    (1995)
  • R.J. Hirst et al.

    Vision dominates audition in adults but not children: A meta-analysis of the colavita effect

    Neuroscience & Biobehavioral Reviews

    (2018)
  • M. Hofer et al.

    Modeling sources of uncertainty in spoken word learning

    Proceedings of the 39th annual meeting of the Cognitive Science Society

    (2017)
  • D.F. Kleinschmidt et al.

    Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel

    Psychological Review

    (2015)
  • Y. Kronrod et al.

    A unified account of categorical effects in phonetic perception

    Psychonomic Bulletin & Review

    (2016)
  • Cited by (0)

    View full text