Performance analysis of various training targets for improving speech quality and intelligibility,Applied Acoustics

当前位置： X-MOL 学术 › Appl. Acoust. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Performance analysis of various training targets for improving speech quality and intelligibility
Applied Acoustics ( IF 3.4 ) Pub Date : 2021-04-01 , DOI: 10.1016/j.apacoust.2020.107817
Shoba Sivapatham , Asutosh Kar , Rajavel Ramadoss

Abstract Denoising a single-channel speech (recorded using one microphone) remains an open problem in many speech-related applications. Recently, supervised deep learning methods are used to denoise the speech signal. This work uses Deep Neural Network (DNN) to learn the Time–Frequency (T-F) mask of the clean speech from its noisy speech features. In general, Ideal Binary Mask (IBM) is used as the binary mask training target to improve speech intelligibility, and Ideal Ratio Mask (IRM) is used as a non-binary mask training target to improve speech quality. Still, it may not necessarily be the best T-F mask to analyze the performance of improvement in speech quality/intelligibility. However, an appropriate training target remains to be unclear for supervised deep learning methods. In this work, a non-binary novel soft T-F mask named Optimum Soft Mask (OSM) is proposed, analyzed and compared with different T-F mask types used for single-channel speech denoising methods. In addition, the target T-F mask is compared with the existing state of art approaches to show a clear performance advantage of supervised deep learning models. The performance of the binary and non-binary training targets of DNN is evaluated under different Signal-to-Noise-Ratio’s and noise conditions ti improve speech quality and intelligibility. The experimental results reveal that the binary mask IBM shows significant improvement in speech intelligibility; the non-binary mask IRM shows a substantial improvement in speech quality. At the same time, the proposed novel soft T-F mask shows notable improvement in both quality and intelligibility under various test conditions.

中文翻译：

提高语音质量和可懂度的各种训练目标的性能分析

摘要在许多与语音相关的应用中，对单通道语音（使用一个麦克风录制）去噪仍然是一个悬而未决的问题。最近，有监督的深度学习方法被用于对语音信号进行去噪。这项工作使用深度神经网络 (DNN) 从噪声语音特征中学习干净语音的时频 (TF) 掩码。一般采用Ideal Binary Mask (IBM)作为二进制掩码训练目标以提高语音清晰度，理想比率掩码(IRM)作为非二进制掩码训练目标以提高语音质量。尽管如此，它可能不一定是分析语音质量/可懂度改进性能的最佳 TF 掩码。然而，对于有监督的深度学习方法，合适的训练目标仍然不清楚。在这项工作中，提出了一种名为 Optimum Soft Mask (OSM) 的非二进制新型软 TF 掩码，并与用于单通道语音去噪方法的不同 TF 掩码类型进行了分析和比较。此外，将目标 TF 掩码与现有的最先进方法进行比较，以显示监督深度学习模型的明显性能优势。在不同的信噪比和噪声条件下评估 DNN 的二进制和非二进制训练目标的性能，以提高语音质量和可懂度。实验结果表明，IBM 二进制掩码在语音清晰度方面有显着提高；非二进制掩码 IRM 显示出语音质量的显着改善。同时，

更新日期：2021-04-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11