当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Glimpse-based estimation of speech intelligibility from speech-in-noise using artificial neural networks
Computer Speech & Language ( IF 4.3 ) Pub Date : 2021-03-10 , DOI: 10.1016/j.csl.2021.101220
Yan Tang

While human listeners can, to some extent, understand the information conveyed by the speech signal when it is mixed with noise, traditional objective intelligibility measures usually fail to operate without a priori knowledge of the clean speech signal. This hence limits the usability of those measures in situations where the clean speech signal is inaccessible. In this paper a glimpse-based method is extended to make speech intelligibility predictions directly from speech-plus-noise mixtures. Using a neural network, the proposed method estimates the time-frequency regions with a local speech-to-noise ratio above a given threshold – known as glimpses – from the mixture signal, instead of separately comparing the speech signal against the noise signal. The number and locations of the glimpses can then be used to produce an intelligibility score. In Experiment I where listener intelligibilities were measured in one stationary and nine fluctuating noise maskers, the predictions produced by the proposed method were highly correlated with the subjective data, with correlation coefficients above 0.90. In Experiment II, with the same neural network trained on normal natural speech as in Experiment I, the proposed method was used to predict the intelligibility of speech signals modified by intelligibility-enhancement algorithms and synthetic speech. The method can still maintain its predictive power by demonstrating a similar performance to its intrusive counterpart with an overall correlation coefficient of 0.81, which is superior to many modern traditional measures evaluated under the same conditions. Therefore, the proposed method can be used to estimate speech intelligibility in place of traditional measures in conditions where their capacity falls short.



中文翻译:

使用人工神经网络从噪声中基于瞥见的语音可懂度估计

尽管听众可以在某种程度上理解语音信号与噪声混合时传达的信息,但传统的客观清晰度措施通常在没有先验知识的干净语音信号的情况下无法运行。因此,这限制了在无法获得干净语音信号的情况下这些措施的可用性。在本文中,扩展了基于瞥见的方法,以直接从语音加噪声的混合物中进行语音清晰度的预测。所提出的方法使用神经网络从混合信号中估计局部语音信噪比高于给定阈值(即瞥见)的时频区域,而不是分别将语音信号与噪声信号进行比较。瞥见的数量和位置然后可以用于产生清晰度分数。在实验I中,在一个固定的和9个波动的噪声掩蔽器中测量了听众的智能度,该方法产生的预测与主观数据高度相关,相关系数大于0.90。在实验II中,使用与实验I相同的神经网络对正常自然语音进行训练,所提方法用于预测通过清晰度增强算法和合成语音修改的语音信号的清晰度。该方法仍可以通过显示与侵入式相似的性能(整体相关系数为0.81)来保持其预测能力,该相关系数优于在相同条件下评估的许多现代传统测量方法。所以,

更新日期:2021-03-21
down
wechat
bug