当前位置: X-MOL 学术Appl. Soft Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-scale decomposition based supervised single channel deep speech enhancement
Applied Soft Computing ( IF 8.7 ) Pub Date : 2020-08-31 , DOI: 10.1016/j.asoc.2020.106666
Nasir Saleem , Muhammad Irfan Khattak

Speech signals reaching our ears are in general contaminated by the background noise distortion which is detrimental to both speech quality and intelligibility. In this paper, we propose a nonlinear multi-scale decomposition-based deep speech enhancement method to improve the quality and intelligibility of the contaminated speech. In the proposed method, we have applied Hurst exponent-based Empirical Mode Decomposition (HEMD) to the noisy speech and obtained a set of intrinsic mode functions (IMFs) and a residual. The Deep Neural Networks (DNNs) are trained for each of the extracted IMF and residual to learn a non-linear mapping with a deep hidden structure to construct a time-frequency mask. We have formulated three deep speech enhancement structures, established on three time-frequency masks comprised of Ideal Ratio Mask (IRM), Ideal Binary Mask (IBM), and Phase Sensitive Mask (PSM). Background noise also degrades the original phase of the clean speech; therefore, introduces perceptual disturbance which leads to negative impacts on the speech quality and intelligibility. To avoid speech quality and intelligibility degradations, an iterative procedure is adopted to compensate the phase during noisy backgrounds. Nonlinear Mel-scale weighted MSE (LMW−MSE) is used as a loss function during network training, and computed the gradients which are based on the perceptually motivated nonlinear frequency scale. Usually, the output features of the conventional deep neural networks are over-smoothed which deteriorates the quality of the speech. To alleviate over-smoothness; frequency-independent spectral variance equalization is applied as a post-filtering method. The performance of the proposed deep enhancement methods is extensively evaluated and compared to the DNNs established on same time-frequency mask in various adverse noisy environments. The results have demonstrated that the proposed deep speech enhancement performed better in terms of the perceived speech quality and intelligibility.



中文翻译:

基于多尺度分解的有监督单通道深度语音增强

到达我们耳朵的语音信号通常会受到背景噪声失真的污染,这会对语音质量和清晰度造成不利影响。本文提出了一种基于非线性多尺度分解的深度语音增强方法,以提高被污染语音的质量和清晰度。在提出的方法中,我们将基于赫斯特指数的经验模态分解(HEMD)应用于嘈杂的语音,并获得了一组固有模态函数(IMF)和一个残差。针对提取的IMF和残差中的每一个训练深度神经网络(DNN),以学习具有深度隐藏结构的非线性映射,以构造时频掩码。我们已经建立了三种深度语音增强结构,它们建立在由理想比率掩模(IRM)组成的三个时频掩模上,理想二进制掩码(IBM)和相敏掩码(PSM)。背景噪声还会降低语音的原始相位。因此,会引入感知干扰,从而对语音质量和清晰度造成负面影响。为了避免语音质量和清晰度下降,在噪声背景期间采用迭代过程来补偿相位。非线性梅尔标度加权MSE(大号MW-MSE)用作网络训练期间的损失函数,并基于感知动机的非线性频率标度来计算梯度。通常,常规深度神经网络的输出特征过于平滑,这会降低语音质量。减轻过度光滑;频率无关的频谱方差均衡被用作后滤波方法。广泛评估了所提出的深度增强方法的性能,并将其与在各种不利噪声环境中在相同时频模板上建立的DNN进行了比较。结果表明,在感知的语音质量和清晰度方面,建议的深度语音增强性能更好。

更新日期:2020-08-31
down
wechat
bug