当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method
arXiv - CS - Sound Pub Date : 2020-03-17 , DOI: arxiv-2003.07544
Cunhang Fan and Jianhua Tao and Bin Liu and Jiangyan Yi and Zhengqi Wen and Xuefei Liu

In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference. In order to enhance the pre-separated speech and improve the separation performance further, the end-to-end post-filter (E2EPF) with deep attention fusion features is proposed. The E2EPF can make full use of the prior knowledge of the pre-separated speech, which contributes to speech separation. It is a fully convolutional speech separation network and uses the waveform as the input features. Firstly, the 1-D convolutional layer is utilized to extract the deep representation features for the mixture and pre-separated signals in the time domain. Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech. These deep attention fusion features are conducive to reduce the interference and enhance the pre-separated speech. Finally, these features are sent to the post-filter to estimate each target signals. Experimental results on the WSJ0-2mix dataset show that the proposed method outperforms the state-of-the-art speech separation method. Compared with the pre-separation method, our proposed method can acquire 64.1%, 60.2%, 25.6% and 7.5% relative improvements in scale-invariant source-to-noise ratio (SI-SNR), the signal-to-distortion ratio (SDR), the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) measures, respectively.

中文翻译:

使用端到端后过滤方法进行语音分离的深度注意力融合特征

在本文中,我们提出了一种具有深度注意力融合特征的端到端后置滤波器方法,用于单耳独立于说话者的语音分离。首先,采用时频域语音分离方法作为预分离阶段。预分离阶段的目的是对混合物进行初步分离。此阶段虽然可以分离混合物,但仍含有残留干扰。为了增强预分离的语音并进一步提高分离性能,提出了具有深度注意力融合特征的端到端后置滤波器(E2EPF)。E2EPF 可以充分利用预分离语音的先验知识,有助于语音分离。它是一个全卷积的语音分离网络,使用波形作为输入特征。首先,一维卷积层用于在时域中提取混合信号和预分离信号的深层表示特征。其次,为了更加关注预分离阶段的输出,应用注意力模块来获取深度注意力融合特征,通过计算混合和预分离语音之间的相似度来提取这些特征。这些深度注意力融合特征有利于减少干扰,增强预分离语音。最后,这些特征被发送到后置滤波器以估计每个目标信号。在 WSJ0-2mix 数据集上的实验结果表明,所提出的方法优于最先进的语音分离方法。与预分离方法相比,我们提出的方法可以获得64.1%、60.2%、25.6%和7。
更新日期:2020-03-18
down
wechat
bug