当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Speech Enhancement based on Denoising Autoencoder with Multi-branched Encoders
arXiv - CS - Sound Pub Date : 2020-01-06 , DOI: arxiv-2001.01538
Cheng Yu, Ryandhimas E. Zezario, Jonathan Sherman, Yi-Yen Hsieh, Xugang Lu, Hsin-Min Wang and Yu Tsao

Deep learning-based models have greatly advanced the performance of speech enhancement (SE) systems. However, two problems remain unsolved, which are closely related to model generalizability to noisy conditions: (1) mismatched noisy condition during testing, i.e., the performance is generally sub-optimal when models are tested with unseen noise types that are not involved in the training data; (2) local focus on specific noisy conditions, i.e., models trained using multiple types of noises cannot optimally remove a specific noise type even though the noise type has been involved in the training data. These problems are common in real applications. In this paper, we propose a novel denoising autoencoder with a multi-branched encoder (termed DAEME) model to deal with these two problems. In the DAEME model, two stages are involved: offline and online. In the offline stage, we build multiple component models to form a multi-branched encoder based on a dynamically-sized decision tree(DSDT). The DSDT is built based on a prior knowledge of speech and noisy conditions (the speaker, environment, and signal factors are considered in this paper), where each component of the multi-branched encoder performs a particular mapping from noisy to clean speech along the branch in the DSDT. Finally, a decoder is trained on top of the multi-branched encoder. In the online stage, noisy speech is first processed by the tree and fed to each component model. The multiple outputs from these models are then integrated into the decoder to determine the final enhanced speech. Experimental results show that DAEME is superior to several baseline models in terms of objective evaluation metrics and the quality of subjective human listening tests.

中文翻译:

基于多分支编码器去噪自编码器的语音增强

基于深度学习的模型极大地提高了语音增强 (SE) 系统的性能。然而,有两个问题仍未解决,它们与模型对噪声条件的泛化性密切相关:(1) 测试过程中噪声条件不匹配,即,当模型使用未涉及的不可见噪声类型进行测试时,性能通常是次优的。训练数据;(2) 局部关注特定的噪声条件,即使用多种类型的噪声训练的模型不能最优地去除特定的噪声类型,即使该噪声类型已经包含在训练数据中。这些问题在实际应用中很常见。在本文中,我们提出了一种具有多分支编码器(称为 DAEME)模型的新型去噪自动编码器来处理这两个问题。在 DAEME 模型中,涉及两个阶段:离线和在线。在离线阶段,我们构建多个组件模型以形成基于动态大小决策树(DSDT)的多分支编码器。DSDT 是基于语音和噪声条件的先验知识(本文中考虑了说话者、环境和信号因素)构建的,其中多分支编码器的每个组件执行从噪声到干净语音的特定映射。 DSDT 中的分支。最后,在多分支编码器之上训练解码器。在在线阶段,嘈杂的语音首先由树处理并馈送到每个组件模型。然后将来自这些模型的多个输出集成到解码器中以确定最终的增强语音。
更新日期:2020-01-14
down
wechat
bug