Vision perceptually restores auditory spectral dynamics in speech.,Proceedings of the National Academy of Sciences of the United States of America

当前位置： X-MOL 学术 › Proc. Natl. Acad. Sci. U.S.A. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Vision perceptually restores auditory spectral dynamics in speech.
Proceedings of the National Academy of Sciences of the United States of America ( IF 9.4 ) Pub Date : 2020-07-21 , DOI: 10.1073/pnas.2002887117
John Plass _{1,

2} , David Brang ₃ , Satoru Suzuki _{2,

4} , Marcia Grabowecky _{2,

4}

Affiliation

Visual speech facilitates auditory speech perception, but the visual cues responsible for these benefits and the information they provide remain unclear. Low-level models emphasize basic temporal cues provided by mouth movements, but these impoverished signals may not fully account for the richness of auditory information provided by visual speech. High-level models posit interactions among abstract categorical (i.e., phonemes/visemes) or amodal (e.g., articulatory) speech representations, but require lossy remapping of speech signals onto abstracted representations. Because visible articulators shape the spectral content of speech, we hypothesized that the perceptual system might exploit natural correlations between midlevel visual (oral deformations) and auditory speech features (frequency modulations) to extract detailed spectrotemporal information from visual speech without employing high-level abstractions. Consistent with this hypothesis, we found that the time–frequency dynamics of oral resonances (formants) could be predicted with unexpectedly high precision from the changing shape of the mouth during speech. When isolated from other speech cues, speech-based shape deformations improved perceptual sensitivity for corresponding frequency modulations, suggesting that listeners could exploit this cross-modal correspondence to facilitate perception. To test whether this type of correspondence could improve speech comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by cross-modal recovery of auditory speech spectra. The perceptual system may therefore use audiovisual correlations rooted in oral acoustics to extract detailed spectrotemporal information from visual speech.

中文翻译：

视觉在感知上恢复语音中的听觉频谱动态。

视觉语音有助于听觉语音感知，但带来这些好处的视觉提示及其提供的信息仍不清楚。低级模型强调嘴部运动提供的基本时间线索，但这些贫乏的信号可能无法完全解释视觉语音提供的听觉信息的丰富性。高级模型假定抽象分类（即，音素/视位）或非模态（例如，发音）语音表示之间的交互，但需要将语音信号有损地重新映射到抽象表示上。由于可见发音器官塑造了语音的频谱内容，我们假设感知系统可能会利用中级视觉（口腔变形）和听觉语音特征（频率调制）之间的自然相关性，从视觉语音中提取详细的频谱时间信息，而无需采用高级抽象。与这一假设一致，我们发现口腔共振（共振峰）的时频动态可以通过言语过程中口腔形状的变化以意想不到的高精度进行预测。当与其他语音线索隔离时，基于语音的形状变形提高了相应频率调制的感知灵敏度，这表明听众可以利用这种跨模态对应来促进感知。为了测试这种类型的对应关系是否可以提高语音理解能力，我们有选择地降低了听觉句子谱图的频谱或时间维度，以评估视觉语音在每种降低条件下促进理解的效果。视觉语音在频谱退化过程中产生了显着更大的增强，这表明听觉语音频谱的跨模式恢复驱动了特定条件的促进效应。因此，感知系统可以使用植根于口腔声学的视听相关性来从视觉语音中提取详细的频谱时间信息。

更新日期：2020-07-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11