当前位置: X-MOL 学术IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Deep Ensemble Learning Method for Monaural Speech Separation.
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 5.4 ) Pub Date : 2016-12-06 , DOI: 10.1109/taslp.2016.2536478
Xiao-Lei Zhang 1 , DeLiang Wang 1
Affiliation  

Monaural speech separation is a fundamental problem in robust speech processing. Recently, deep neural network (DNN)-based speech separation methods, which predict either clean speech or an ideal time-frequency mask, have demonstrated remarkable performance improvement. However, a single DNN with a given window length does not leverage contextual information sufficiently, and the differences between the two optimization objectives are not well understood. In this paper, we propose a deep ensemble method, named multicontext networks, to address monaural speech separation. The first multicontext network averages the outputs of multiple DNNs whose inputs employ different window lengths. The second multicontext network is a stack of multiple DNNs. Each DNN in a module of the stack takes the concatenation of original acoustic features and expansion of the soft output of the lower module as its input, and predicts the ratio mask of the target speaker; the DNNs in the same module employ different contexts. We have conducted extensive experiments with three speech corpora. The results demonstrate the effectiveness of the proposed method. We have also compared the two optimization objectives systematically and found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations.

中文翻译:

一种用于单声道语音分离的深度集成学习方法。

单声道语音分离是鲁棒语音处理中的基本问题。近来,基于深度神经网络(DNN)的语音分离方法可预测清晰的语音或理想的时频蒙版,已证明其性能得到了显着提高。但是,具有给定窗口长度的单个DNN无法充分利用上下文信息,并且对两个优化目标之间的差异也知之甚少。在本文中,我们提出了一种称为多上下文网络的深度集成方法,以解决单声道语音分离问题。第一个多上下文网络对输入使用不同窗口长度的多个DNN的输出求平均。第二个多上下文网络是多个DNN的堆栈。堆栈模块中的每个DNN都将原始声学特征的串联和下部模块的软输出的扩展作为其输入,并预测目标扬声器的比率掩码;同一模块中的DNN使用不同的上下文。我们对三种语音语料库进行了广泛的实验。结果证明了该方法的有效性。我们还系统地比较了这两个优化目标,发现预测理想的时频模板在利用干净的训练语音方面更为有效,而预测干净的语音对SNR变化的敏感性较低。我们对三种语音语料库进行了广泛的实验。结果证明了该方法的有效性。我们还系统地比较了这两个优化目标,发现预测理想的时频模板在利用干净的训练语音方面更为有效,而预测干净的语音对SNR变化的敏感性较低。我们对三种语音语料库进行了广泛的实验。结果证明了该方法的有效性。我们还系统地比较了这两个优化目标,发现预测理想的时频模板在利用干净的训练语音方面更为有效,而预测干净的语音对SNR变化的敏感性较低。
更新日期:2019-11-01
down
wechat
bug