当前位置: X-MOL 学术Digit. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN
Digital Signal Processing ( IF 2.9 ) Pub Date : 2020-03-31 , DOI: 10.1016/j.dsp.2020.102731
Ruwei Li , Xiaoyue Sun , Tao Li , Fengnian Zhao

In this study, a novel multi-objective speech enhancement algorithm is proposed. First, we construct a deep learning architecture based on a stacked and temporal convolutional neural network (STCNN). Second, the main log-power spectra (LPS) features are input into a stacked convolutional neural network (SCNN) to extract advanced abstract features. Third, an improved power function compression Mel-frequency cepstral coefficient (PC-MFCC) feature—more consistent with human hearing characteristics than a Mel-frequency cepstral coefficient (MFCC)—is proposed. Then, a temporal convolutional neural network (TCNN) uses PC-MFCC and learned features from SCNN as input, and separately predicts a clean LPS, PC-MFCC and Ideal Ratio Mask (IRM). In this training phase, PC-MFCC constrains the LPS and IRM through a loss function to obtain the optimal network structure. Finally, IRM-based post-processing is used on the estimated clean LPS and IRM, which adjusts the weight between the above LPS and IRM to synthesise enhanced speech based on voice presence information. A series of experiments show that PC-MFCC is effective and shows complementarity with LPS in speech enhancement tasks. The proposed STCNN architecture has a higher speech enhancement performance than the comparative neural network models with good feature extraction and sequence modelling capabilities. Additionally, IRM-based post-processing further enhances the listening quality of reconstructed speech. Compared with the contrasting algorithm, the speech quality and intelligibility of enhanced speech based on the proposed multi-objective speech enhancement algorithm are further improved.



中文翻译:

基于IRM后处理,SCNN和TCNN联合估计的多目标学习语音增强算法

在这项研究中,提出了一种新颖的多目标语音增强算法。首先,我们基于堆叠的时间卷积神经网络(STCNN)构建深度学习架构。其次,将主要的对数功率谱(LPS)特征输入到堆叠的卷积神经网络(SCNN)中,以提取高级抽象特征。第三,提出了一种改进的幂函数压缩梅尔频率倒谱系数(PC-MFCC)特征(比梅尔频率倒谱系数(MFCC)更符合人类的听觉特性)。然后,时间卷积神经网络(TCNN)使用PC-MFCC和从SCNN学习的特征作为输入,并分别预测干净的LPS,PC-MFCC和理想比率掩码(IRM)。在这个训练阶段 PC-MFCC通过损耗函数约束LPS和IRM,以获得最佳的网络结构。最后,在估计的干净LPS和IRM上使用基于IRM的后处理,它可以调整上述LPS和IRM之间的权重以基于语音存在信息合成增强语音。一系列实验表明,PC-MFCC是有效的,并且在语音增强任务中与LPS具有互补性。与具有良好特征提取和序列建模能力的比较神经网络模型相比,所提出的STCNN体系结构具有更高的语音增强性能。此外,基于IRM的后处理进一步提高了重建语音的收听质量。与对比算法相比,

更新日期:2020-04-01
down
wechat
bug