当前位置: X-MOL 学术J. Ambient Intell. Human. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-objective long-short term memory recurrent neural networks for speech enhancement
Journal of Ambient Intelligence and Humanized Computing Pub Date : 2020-10-16 , DOI: 10.1007/s12652-020-02598-4
Nasir Saleem , Muhammad Irfan Khattak , Mu’ath Al-Hasan , Atif Jan

Speech-in-noise perception is an important research problem in many real-world multimedia applications. The noise-reduction methods contributed significantly; however rely on a priori information about the noise signals. Deep learning approaches are developed for enhancing the speech signals in nonstationary noisy backgrounds and their benefits are evaluated for the perceived speech quality and intelligibility. In this paper, a multi-objective speech enhancement based on the Long-Short Term Memory (LSTM) recurrent neural network (RNN) is proposed to simultaneously estimate the magnitude and phase spectra of clean speech. During training, the noisy phase spectrum is incorporated as a target and the unstructured phase spectrum is transformed to its derivative that has an identical structure to corresponding magnitude spectrum. Critical Band Importance Functions (CBIFs) are used in training process to further improve the network performance. The results verified that the proposed multi-objective LSTM (MO-LSTM) successfully outscored the standard magnitude-aware LSTM (MA-LSTM), magnitude-aware DNN (MA-DNN), phase-aware DNN (PA-DNN), magnitude-aware GNN (MA-GNN) and magnitude-aware CNN (MA-CNN). Moreover, the proposed speech enhancement considerably improved the speech quality, intelligibility, noise-reduction and automatic speech recognition in changing noisy backgrounds, which is confirmed by the ANalysis Of VAriance (ANOVA) statistical analysis.



中文翻译:

多目标长短期记忆递归神经网络用于语音增强

噪声语音感知是许多现实世界中多媒体应用中的重要研究问题。降噪方法的贡献很大;然而,依赖于关于噪声信号的先验信息。开发了深度学习方法,以增强非平稳嘈杂背景下的语音信号,并针对感知的语音质量和清晰度评估了它们的好处。本文提出了一种基于长期记忆(LSTM)递归神经网络(RNN)的多目标语音增强算法,以同时估计干净语音的幅度和相位谱。在训练期间,将嘈杂的相位谱作为目标并入,将非结构化的相位谱转换为其导数,该导数的结构与相应的幅度谱相同。关键频带重要性函数(CBIF)用于训练过程中,以进一步提高网络性能。结果证明,所提出的多目标LSTM(MO-LSTM)成功超过了标准的幅度感知LSTM(MA-LSTM),幅度感知DNN(MA-DNN),相位感知DNN(PA-DNN),幅度感知GNN(MA-GNN)和幅度感知CNN(MA-CNN)。此外,所提出的语音增强功能在不断变化的嘈杂背景中大大改善了语音质量,清晰度,降噪和自动语音识别,这一点已通过变异性分析(ANOVA)统计分析得到了证实。结果证明,所提出的多目标LSTM(MO-LSTM)成功超过了标准的幅度感知LSTM(MA-LSTM),幅度感知DNN(MA-DNN),相位感知DNN(PA-DNN),幅度感知GNN(MA-GNN)和幅度感知CNN(MA-CNN)。此外,所提出的语音增强功能在不断变化的嘈杂背景中大大改善了语音质量,清晰度,降噪和自动语音识别,这一点已通过变异性分析(ANOVA)统计分析得到了证实。结果证明,所提出的多目标LSTM(MO-LSTM)成功超过了标准的幅度感知LSTM(MA-LSTM),幅度感知DNN(MA-DNN),相位感知DNN(PA-DNN),幅度感知GNN(MA-GNN)和幅度感知CNN(MA-CNN)。此外,所提出的语音增强功能在不断变化的嘈杂背景中大大改善了语音质量,清晰度,降噪和自动语音识别,这一点已通过变异性分析(ANOVA)统计分析得到了证实。

更新日期:2020-10-17
down
wechat
bug