Learning spectro-temporal representations of complex sounds with parameterized neural networks,The Journal of the Acoustical Society of America

当前位置： X-MOL 学术 › J. Acoust. Soc. Am. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning spectro-temporal representations of complex sounds with parameterized neural networks
The Journal of the Acoustical Society of America ( IF 2.4 ) Pub Date : 2021-07-14 , DOI: 10.1121/10.0005482
Rachid Riad ₁ , Julien Karadayi ₁ , Anne-Catherine Bachoud-Lévi ₂ , Emmanuel Dupoux ₁

Affiliation

Deep learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes in a variety of auditory tasks, yet these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, which computes specific spectro-temporal modulations based on Gabor filters [learnable spectro-temporal filters (STRFs)] and is fully interpretable. We evaluated this layer on speech activity detection, speaker verification, urban sound classification, and zebra finch call type classification. We found that models based on learnable STRFs are on par for all tasks with state-of-the-art and obtain the best performance for speech activity detection. As this layer remains a Gabor filter, it is fully interpretable. Thus, we used quantitative measures to describe distribution of the learned spectro-temporal modulations. Filters adapted to each task and focused mostly on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, we observed that the tasks organized in a meaningful way: the human vocalization tasks closer to each other and bird vocalizations far away from human vocalizations and urban sounds tasks.

中文翻译：

使用参数化神经网络学习复杂声音的时域表征

由于最近在各种听觉任务中取得了成功，深度学习模型已成为听觉神经科学研究的潜在候选者，但这些模型通常缺乏可解释性，无法完全理解已执行的确切计算。在这里，我们提出了一个参数化神经网络层，它基于 Gabor 滤波器 [可学习的光谱时间滤波器 (STRF)] 计算特定的光谱时间调制，并且是完全可解释的。我们在语音活动检测、说话人验证、城市声音分类和斑胸草雀呼叫类型分类方面评估了这一层。我们发现基于可学习 STRF 的模型对于所有最先进的任务都是不相上下的，并且在语音活动检测方面获得了最佳性能。由于该层仍然是 Gabor 滤波器，因此它是完全可解释的。因此，我们使用定量测量来描述学习到的光谱时间调制的分布。滤波器适应每项任务，主要关注低时间和频谱调制。分析表明，在人类语音中学习的滤波器与直接在人类听觉皮层中测量的滤波器具有相似的光谱时间参数。最后，我们观察到任务以一种有意义的方式组织：人类发声任务彼此更接近，鸟类发声远离人类发声和城市声音任务。分析表明，在人类语音中学习的滤波器与直接在人类听觉皮层中测量的滤波器具有相似的光谱时间参数。最后，我们观察到任务以一种有意义的方式组织：人类发声任务彼此更接近，鸟类发声远离人类发声和城市声音任务。分析表明，在人类语音中学习的滤波器与直接在人类听觉皮层中测量的滤波器具有相似的光谱时间参数。最后，我们观察到任务以一种有意义的方式组织：人类发声任务彼此更接近，鸟类发声远离人类发声和城市声音任务。

更新日期：2021-07-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>