Original ArticleHow optimal is word recognition under multimodal uncertainty?
Introduction
Language uses symbols expressed in one modality — the auditory modality, in the case of speech — to communicate about the world, which we perceive through many different sensory modalities. Consider hearing someone yell “bee!” at a picnic, as a honey bee buzzes around the food. Identifying a word involves processing the auditory information as well as other perceptual signals (e.g., the visual image of the bee, the sound of its wings, the sensation of the bee flying by your arm). A word is successfully identified when information from these modalities provides convergent evidence.
However, word identification takes place in a noisy world, and the cues received through each modality may not provide a definitive answer. On the auditory side, individual acoustic word tokens are almost always ambiguous with respect to the particular sequence of phonemes they represent, which is due to the inherent variability of how a phonetic category is realized acoustically (Hillenbrand et al., 1995). Moreover, some tokens may be distorted additionally by mispronunciation or ambient noise. Perhaps the speaker was yelling “pea” and not “bee.” Similarly, a sensory impression may not be enough to make a definitive identification of a visual category.1 Perhaps the insect was a beetle or a fly instead. How does the listener deal with such multimodal uncertainty to recognize the speaker's intended word?
As a simplified case study of early word learning, the task of matching sounds to corresponding visual objects has been studied extensively in the developmental literature. For example, many studies focus on how children might succeed in this type of task despite referential ambiguity ([Medina et al., 2011], [Pinker, 1989], [Smith and Yu, 2008], [Suanda et al., 2014], [Vlach and Johnson, 2013], [Vouloumanos, 2008], [Yurovsky and Frank, 2015]). However, even when they have learned the exact meaning of a word, observers (both children and adults) often still find it challenging to recognize which word the speaker has uttered, especially under noise ([Mattys et al., 2012], [Peelle, 2018]). The purpose of the current study is thus to explore word recognition by adults under multimodal uncertainty, focusing on the special case where people have access to multimodal cues from the auditory speech and the visual referent. In the General Discussion, we return to the question of how these findings relate to questions about word learning.
One rigorous way to approach this question is through conducting an ideal observer analysis. This research strategy provides a characterization of the task/goal and shows what the optimal performance should be under this characterization.2 When there is uncertainty in the input, the ideal observer performs an optimal probabilistic inference. For example, in order to recognize an ambiguous linguistic input, the model uses all available probabilistic knowledge in order to maximize the accuracy of this recognition. The ideal observer model can be seen as a theoretical upper limit on performance. It is not so much a realistic model of human performance, as much as a baseline against which human performance can be compared ([Geisler, 2003], [Rahnev and Denison, 2018]). When there is a deviation from the ideal, it can reveal extra constraints on human cognition, such as limitations on the working memory or on attentional resources. This approach has had a tremendous impact not only on speech-related research ([Clayards et al., 2008], [Feldman et al., 2009], [Kleinschmidt and Jaeger, 2015], [Norris and McQueen, 2008]) but also on many other disciplines in the cognitive sciences (for reviews, see [Chater and Manning, 2006], [Knill and Pouget, 2004], [Tenenbaum et al., 2011]).
Some prior ideal observer studies are closely related to the question we are addressing in the current work. For instance, Clayards et al. (2008) simulated auditory uncertainty by manipulating the probability distribution of a cue (Voice Onset Time) that differentiated similar words (e.g., “beach” and “peach”). They found that humans were sensitive to these probabilistic cues and their judgments closely reflected the optimal predictions. Moreover, Feldman et al. (2009) studied the perceptual magnet effect, a phenomenon that involves reduced discriminability near prototypical sounds in the native language (Kuhl, 1991), showing that this effect can be explained as the consequence of optimally solving the problem of perception under uncertainty (see also Kronrod et al., 2016).
Besides the acoustic cues explored in Clayards et al. (2008) and Feldman et al. (2009), there is extensive evidence that information from the visual modality, such as the speaker's facial features, also influences speech understanding (see Campbell, 2008 for a review). Bejjanki et al. (2011) offered a mathematical characterization of how probabilistic cues from speech and lip movements can be optimally combined. They showed that human performance during audio-visual phonemic labeling was consistent (at least at the qualitative level) with the predictions of an ideal observer. This previous research did not, however, study speech understanding when visual information was obtained through the referential context rather than through observation of the speaker's face. Although some experimental findings show that information about the identity of a referent can be integrated with linguistic information to resolve lexical and syntactic ambiguities in speech (e.g., [Eberhard et al., 1995], [Spivey et al., 2002], [Tanenhaus et al., 1995]), to our knowledge no study has offered an ideal observer analysis of this task (as we do here).
Combining information between words and visual referents might seem similar to audio-visual speech integration (e.g., Bejjanki et al., 2011), but there are at least two fundamental differences between these two cases, and both can influence the way the auditory and visual cues are combined.
First, in the case of audio-visual speech, both modalities offer information about the same underlying speech category. They differ only in terms of their informational reliability. In a referential context, however, the auditory and visual modalities play different roles in the referential process — the auditory input represents the symbol whereas the visual input represents the meaning (and these differences are in addition to possible differences in informational reliability). Speech is claimed to have a privileged status compared to other sensory stimuli ([Edmiston and Lupyan, 2015], [Lupyan and Thompson-Schill, 2012], [Vouloumanos and Waxman, 2014], [Waxman and Gelman, 2009], [Waxman and Markow, 1995]), and this privilege is suggested to be specifically related to the ability to refer (Waxman and Gelman, 2009).3 Thus, in a referential context, it is possible that listeners do not treat the auditory and visual modalities as equivalent sources of information. Instead, there could be a (potentially sub-optimal) bias for the auditory modality beyond what is expected from informational reliability alone.
Second, in the case of audio-visual speech, the auditory and visual stimuli are expected to be perceptually correlated. The expectation for this correlation is strong enough that when there is a mismatch between the auditory and visual input, they are still integrated into a unified (but illusory) percept (e.g., the McGurk Effect; McGurk and MacDonald, 1976). In the case of referential language, however, the multimodal association is by nature arbitrary ([Greenberg, 1957], [Saussure, 1916]). For instance, there is no logical or perceptual connection between the sound “bee” and the corresponding insect. Moreover, variation in the way the sound “bee” is pronounced is generally not expected to correlate perceptually with variation in the shape (or any other visual property) in the category of bees. In sum, cue combination in the case of arbitrary audio-visual associations (word-referent) is likely to be less automatic, more effortful, and therefore less conducive to optimal integration than it is in the case of perceptually correlated associations (as in audio-visual speech perception).
We investigate how cues from the auditory and the visual modality are combined to recognize novel words in a referential context. In particular, we study how this combination is performed under various degrees of uncertainty in both the auditory and the visual modality. Imagine, for example, that someone is uncertain whether they heard “pea” or “bee”. Does this uncertainty make them rely more on the referent (e.g., the object being pointed at)? Or, if they are not sure if they saw a bee or a fly, does this uncertainty make them rely more on the sound? More importantly, when input in both modalities is uncertain to varying degrees, do they weight each modality according to its relative reliability (the optimal strategy), or do they over-rely on a particular modality?
We begin by proposing an ideal observer model that performs the combination in an optimal fashion. We then compare the predictions of the optimal model to human responses. Humans can deviate from the ideal for several reasons. For instance, as mentioned above, a sub-optimality can be induced by the privileged status of a particular modality or by the arbitrariness of the referential association. In order to study possible patterns of sub-optimality, we compare the optimal normative model to a descriptive model (which is fit to actual responses). Comparing parameter estimates between these two formulations allows us to quantify the degree of deviation from optimality.
We tested the ideal observer model's predictions in four behavioral experiments where we varied the source of uncertainty. In Experiment 1, audio-visual tokens were ambiguous with respect to their category membership (in addition to sensory noise). In Experiment 2, we intervened by adding environmental noise that degraded information from the auditory modality, and in Experiment 3 we intervened by adding environmental noise that degraded information from the visual modality. Finally, in Experiment 4, we replicated Experiment 1 with a higher power design, allowing us to test cue combination at the individual level.
Section snippets
Paradigm and models
In this section, we first briefly introduce the multimodal combination task. Then we explain how behavior in this paradigm can be characterized optimally with an ideal observer model.
Experiment 1
In this experiment, we test the predictions of the model in the case where uncertainty is due to categorical variability (i.e., ambiguity in terms of category membership) and inherent sensory noise. We do not add any external noise to the background. Thus, we test the following (normative) cue weighting scheme.8
Experiment 2
In this Experiment, we explored the effect of added environmental noise on performance. We tested a case where the background noise was added to the auditory modality. We were interested to know if participants would treat this new source of uncertainty as predicted by the optimal model, that is, according to the following weighting scheme:
The alternative hypothesis is that noise in one modality leads to a systematic preference for the non-noisy
Experiment 3
Similar to Experiment 2, we were interested to know if participants would treat additional uncertainty as predicted by the optimal model, that is, according to the following weighting scheme:
The alternative hypothesis is that noise in the visual modality would lead to a preference for the auditory input, just like noise in the auditory modality lead to a preference for the visual input in Experiment 2.
Experiment 4
As we noted earlier, we did not have enough statistical power in Experiment 1 to fit a different model for each participant. Thus, here we used a higher power design, allowing us to collect the number of data points necessary to model cue combination at the individual level.
General discussion
In the current paper, we explored word recognition under uncertainty about both words and their referents. We conducted an ideal observer analysis of this task whereby a model provided predictions about how information from each modality should be combined in an optimal fashion. The predictions of the model were tested in a series of four experiments where instances of both the form and the meaning were ambiguous with respect to their category membership only (Experiments 1 and 4), when
Conclusions and future research directions
Our work used an ideal observer model to study word recognition under audio-visual uncertainty. This framework enabled us not only to test optimality but also to examine systematically how and by how much people deviate from optimality in their combination strategies. Thus, our work is part of a growing effort to go beyond optimality tests — which have limited explanatory power — and use models that also allow us to identify and explain various patterns of sub-optimality in human behavior (
Data availability
All data and code for these analyses are available at https://github.com/afourtassi/WordRec
Conflicts of interest
None of the authors have any financial interest or a conflict of interest regarding this work and this submission.
Acknowledgements
This work was supported by a post-doctoral grant from the Fyssen Foundation.
References (80)
- et al.
Probabilistic models of language processing and acquisition
Trends in Cognitive Sciences
(2006) - et al.
Arbitrariness, iconicity, and systematicity in language
Trends in Cognitive Sciences
(2015) Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner
Cognition
(2018)- et al.
What makes words special? Words as unmotivated cues
Cognition
(2015) - et al.
Naming influences 9-month-olds’ identification of discrete categories along a perceptual continuum
Cognition
(2016) - et al.
The bayesian brain: The role of uncertainty in neural coding and computation
Trends in Neurosciences
(2004) - et al.
Effects of cognitive load on speech recognition
Journal of Memory and Language
(2011) Metacognitive experiences in consumer judgment and decision making
Journal of Consumer Psychology
(2004)- et al.
Infants rapidly learn word-referent mappings via cross-situational statistics
Cognition
(2008) - et al.
Eye movements and spoken language comprehension: Effects of visual context on syntactic ambiguity resolution
Cognitive Psychology
(2002)