On Training Targets for Supervised Speech Separation.,IEEE/ACM Transactions on Audio, Speech, and Language Processing

当前位置： X-MOL 学术 › IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On Training Targets for Supervised Speech Separation.
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 5.4 ) Pub Date : 2015-01-20 , DOI: 10.1109/taslp.2014.2352935
Yuxuan Wang ₁ , Arun Narayanan ₁ , DeLiang Wang ₂

Affiliation

Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.

中文翻译：

关于监督性语音分离的培训目标。

语音分离作为一种有监督的学习问题已经显示出很大的希望。以其最简单的形式，训练有监督的学习算法（通常是深度神经网络）来学习从嘈杂的特征到感兴趣目标的时频表示的映射。传统上，理想的二进制掩码（IBM）由于其简单性和较大的语音清晰度而被用作目标。但是，监督学习框架不限于使用二进制目标。在这项研究中，我们通过使用不同的训练目标，包括IBM，目标二进制掩码，理想比率掩码（IRM），短时傅立叶变换频谱幅度及其对应的掩码（FFT-MASK），来评估和比较分离结果，以及Gammatone频率功率谱。我们在各种测试条件下的结果表明，在客观清晰度和质量指标方面，两个比率模板目标IRM和FFT-MASK优于其他目标。另外，我们发现基于掩蔽的目标通常要比基于频谱包络的目标好得多。我们还介绍了与非负矩阵分解和语音增强中的最新方法的比较，这些结果显示了监督语音分离的明显性能优势。

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>