Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning,Neural Computation

当前位置： X-MOL 学术 › Neural Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning
Neural Computation ( IF 2.7 ) Pub Date : 2020-03-01 , DOI: 10.1162/neco_a_01264
Thomas Hueber ₁ , Eric Tatulli ₁ , Laurent Girin ₂ , Jean-Luc Schwartz ₁

Affiliation

Sensory processing is increasingly conceived in a predictive framework in which neurons would constantly process the error signal resulting from the comparison of expected and observed stimuli. Surprisingly, few data exist on the accuracy of predictions that can be computed in real sensory scenes. Here, we focus on the sensory processing of auditory and audiovisual speech. We propose a set of computational models based on artificial neural networks (mixing deep feedforward and convolutional networks), which are trained to predict future audio observations from present and past audio or audiovisual observations (i.e., including lip movements). Those predictions exploit purely local phonetic regularities with no explicit call to higher linguistic levels. Experiments are conducted on the multispeaker LibriSpeech audio speech database (around 100 hours) and on the NTCD-TIMIT audiovisual speech database (around 7 hours). They appear to be efficient in a short temporal range (25–50 ms), predicting 50% to 75% of the variance of the incoming stimulus, which could result in potentially saving up to three-quarters of the processing power. Then they quickly decrease and almost vanish after 250 ms. Adding information on the lips slightly improves predictions, with a 5% to 10% increase in explained variance. Interestingly the visual gain vanishes more slowly, and the gain is maximum for a delay of 75 ms between image and predicted sound.

中文翻译：

使用深度学习评估听觉和视听语音预测编码的潜在增益

感觉处理越来越多地被设想在一个预测框架中，在这个框架中，神经元会不断地处理由预期和观察到的刺激的比较产生的误差信号。令人惊讶的是，关于可以在真实感官场景中计算的预测准确性的数据很少。在这里，我们专注于听觉和视听语音的感官处理。我们提出了一组基于人工神经网络（混合深度前馈和卷积网络）的计算模型，这些模型经过训练可以根据当前和过去的音频或视听观察（即包括唇部运动）预测未来的音频观察。这些预测完全利用本地语音规律，没有明确调用更高的语言水平。实验在多扬声器 LibriSpeech 音频语音数据库（约 100 小时）和 NTCD-TIMIT 视听语音数据库（约 7 小时）上进行。它们似乎在短时间范围（25-50 毫秒）内有效，预测传入刺激的方差的 50% 到 75%，这可能会节省多达四分之三的处理能力。然后它们迅速减少并在 250 毫秒后几乎消失。在嘴唇上添加信息稍微改善了预测，解释方差增加了 5% 到 10%。有趣的是，视觉增益消失得更慢，并且在图像和预测声音之间延迟 75 毫秒时增益最大。预测传入刺激的 50% 到 75% 的方差，这可能会节省多达四分之三的处理能力。然后它们迅速减少并在 250 毫秒后几乎消失。在嘴唇上添加信息稍微改善了预测，解释方差增加了 5% 到 10%。有趣的是，视觉增益消失得更慢，并且在图像和预测声音之间延迟 75 毫秒时增益最大。预测传入刺激的 50% 到 75% 的方差，这可能会节省多达四分之三的处理能力。然后它们迅速减少并在 250 毫秒后几乎消失。在嘴唇上添加信息稍微改善了预测，解释方差增加了 5% 到 10%。有趣的是，视觉增益消失得更慢，并且在图像和预测声音之间延迟 75 毫秒时增益最大。

更新日期：2020-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11