当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A CNN-based approach to identification of degradations in speech signals
EURASIP Journal on Audio, Speech, and Music Processing ( IF 2.4 ) Pub Date : 2021-02-05 , DOI: 10.1186/s13636-021-00198-4
Yuki Saishu , Amir Hossein Poorjam , Mads Græsbøll Christensen

The presence of degradations in speech signals, which causes acoustic mismatch between training and operating conditions, deteriorates the performance of many speech-based systems. A variety of enhancement techniques have been developed to compensate the acoustic mismatch in speech-based applications. To apply these signal enhancement techniques, however, it is necessary to know prior information about the presence and the type of degradations in speech signals. In this paper, we propose a new convolutional neural network (CNN)-based approach to automatically identify the major types of degradations commonly encountered in speech-based applications, namely additive noise, nonlinear distortion, and reverberation. In this approach, a set of parallel CNNs, each detecting a certain degradation type, is applied to the log-mel spectrogram of audio signals. Experimental results using two different speech types, namely pathological voice and normal running speech, show the effectiveness of the proposed method in detecting the presence and the type of degradations in speech signals which outperforms the state-of-the-art method. Using the score weighted class activation mapping, we provide a visual analysis of how the network makes decision for identifying different types of degradation in speech signals by highlighting the regions of the log-mel spectrogram which are more influential to the target degradation.

中文翻译:

基于CNN的语音信号劣化识别方法

语音信号质量下降的存在会导致训练和操作条件之间的声学​​失配,从而使许多基于语音的系统的性能下降。已经开发了多种增强技术来补偿基于语音的应用中的声学失配。然而,为了应用这些信号增强技术,必须知道有关语音信号的存在和劣化类型的先验信息。在本文中,我们提出了一种基于卷积神经网络(CNN)的新方法,以自动识别基于语音的应用程序中常见的主要降级类型,即加性噪声,非线性失真和混响。在这种方法中,一组并行的CNN(每个都检测某种降级类型)应用于音频信号的对数梅尔频谱图。使用两种不同语音类型(病理性语音和正常跑步语音)的实验结果表明,该方法在检测语音信号的存在和劣化类型方面的有效性超过了现有技术。使用得分加权的类别激活映射,我们通过突出显示对目标降解影响更大的log-mel声谱图区域,对网络如何做出决策以识别语音信号中不同类型的降解进行可视化分析。展示了所提出的方法在检测语音信号中存在的劣化类型方面的有效性,其性能优于最新方法。使用得分加权的类别激活映射,我们通过突出显示对目标降解更有影响的log-mel声谱图区域,对网络如何做出决策以识别语音信号中不同类型的降解进行可视化分析。展示了所提出的方法在检测语音信号中存在的劣化类型方面的有效性,其性能优于最新方法。使用得分加权的类别激活映射,我们通过突出显示对目标降解更有影响的log-mel声谱图区域,对网络如何做出决策以识别语音信号中不同类型的降解进行可视化分析。
更新日期:2021-02-05
down
wechat
bug