当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Components loss for neural networks in mask-based speech enhancement
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2021-07-02 , DOI: 10.1186/s13636-021-00207-6
Ziyi Xu 1 , Samy Elshamy 1 , Ziyue Zhao 1 , Tim Fingscheidt 1
Affiliation  

Estimating time-frequency domain masks for single-channel speech enhancement using deep learning methods has recently become a popular research field with promising results. In this paper, we propose a novel components loss (CL) for the training of neural networks for mask-based speech enhancement. During the training process, the proposed CL offers separate control over preservation of the speech component quality, suppression of the noise component, and preservation of a naturally sounding residual noise component. We illustrate the potential of the proposed CL by evaluating a standard convolutional neural network (CNN) for mask-based speech enhancement. The new CL is compared to several baseline losses, comprising the conventional mean squared error (MSE) loss w.r.t. speech spectral amplitudes or w.r.t. an ideal-ratio mask, auditory-related loss functions, such as the perceptual evaluation of speech quality (PESQ) loss and the perceptual weighting filter loss, and also the recently proposed SNR loss with two masks. Detailed analysis suggests that the proposed CL obtains a better or at least a more balanced performance across all employed instrumental quality metrics, including SNR improvement, speech component quality, enhanced total speech quality, and particularly also delivers a natural sounding residual noise component. For unseen noise types, we excel even perceptually motivated losses by an about 0.2 points higher PESQ score. The recently proposed so-called SNR loss with two masks not only requires a network with more parameters due to the two decoder heads, but also falls behind on PESQ and POLQA and particularly w.r.t. residual noise quality. Note that the proposed CL shows significantly more 1st ranks among the evaluation metrics than any other baseline. It is easy to implement, and code is provided at https://github.com/ifnspaml/Components-Loss .

中文翻译:

基于掩码的语音增强中神经网络的组件损失

使用深度学习方法估计用于单通道语音增强的时频域掩码最近已成为一个受欢迎的研究领域,并取得了可喜的成果。在本文中,我们提出了一种新的组件损失 (CL),用于训练基于掩码的语音增强的神经网络。在训练过程中,所提出的 CL 提供对语音分量质量的保留、噪声分量的抑制和自然发声的残余噪声分量的保留的单独控制。我们通过评估用于基于掩码的语音增强的标准卷积神经网络 (CNN) 来说明所提出的 CL 的潜力。新的 CL 与几种基线损失进行了比较,包括传统的均方误差 (MSE) 损失与语音频谱幅度或理想比率掩码,与听觉相关的损失函数,例如语音质量感知评估 (PESQ) 损失和感知加权滤波器损失,以及最近提出的带有两个掩码的 SNR 损失。详细分析表明,所提出的 CL 在所有采用的乐器质量指标中获得了更好或至少更平衡的性能,包括 SNR 改进、语音分量质量、增强的总语音质量,并且特别是还提供了听起来自然的残余噪声分量。对于看不见的噪声类型,我们甚至比 PESQ 得分高出约 0.2 分,甚至在感知驱动的损失方面表现出色。最近提出的具有两个掩码的所谓 SNR 损失不仅由于两个解码器头而需要具有更多参数的网络,而且在 PESQ 和 POLQA 上也落后,尤其是残差噪声质量。请注意,与任何其他基线相比,提议的 CL 在评估指标中显示出更多的第一名。它很容易实现,代码在 https://github.com/ifnpaml/Components-Loss 提供。
更新日期:2021-07-02
down
wechat
bug