当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Robust emotional speech recognition based on binaural model and emotional auditory mask in noisy environments
EURASIP Journal on Audio, Speech, and Music Processing ( IF 2.4 ) Pub Date : 2018-08-28 , DOI: 10.1186/s13636-018-0133-9
Meysam Bashirpour , Masoud Geravanchizadeh

The performance of automatic speech recognition systems degrades in the presence of emotional states and in adverse environments (e.g., noisy conditions). This greatly limits the deployment of speech recognition application in realistic environments. Previous studies in the emotion-affected speech recognition field focus on improving emotional speech recognition using clean speech data recorded in a quiet environment (i.e., controlled studio settings). The goal of this research is to increase the robustness of speech recognition systems for emotional speech in noisy conditions. The proposed binaural emotional speech recognition system is based on the analysis of binaural input signal and an estimated emotional auditory mask corresponding to the recognized emotion. Whereas the binaural signal analyzer has the task of segregating speech from noise and constructing speech mask in a noisy environment, the estimated emotional mask identifies and removes the most emotionally affected spectro-temporal regions of the segregated target speech. In other words, our proposed system combines the two estimated masks (binary mask and emotion-specific mask) of noise and emotion, as a way to decrease the word error rate for noisy emotional speech. The performance of the proposed binaural system is evaluated in clean neutral train/noisy emotional test scenarios for different noise types, signal-to-noise ratios, and spatial configurations of sources. Speech utterances of the Persian emotional speech database are used for the experimental purposes. Simulation results show that the proposed system achieves higher performance, as compared with automatic speech recognition systems chosen as baseline trained with neutral utterances.

中文翻译:

噪声环境下基于双耳模型和情感听觉掩膜的鲁棒情感语音识别

自动语音识别系统的性能在存在情绪状态和不利环境(例如,嘈杂的条件)下会降低。这极大地限制了语音识别应用在现实环境中的部署。先前在情绪影响语音识别领域的研究侧重于使用在安静环境(即受控工作室设置)中记录的干净语音数据来改进情绪语音识别。这项研究的目标是提高语音识别系统在嘈杂条件下对情感语音的鲁棒性。所提出的双耳情感语音识别系统基于对双耳输入信号的分析和与识别的情感相对应的估计情感听觉掩膜。双耳信号分析器的任务是从噪声中分离语音并在嘈杂的环境中构建语音掩膜,而估计的情感掩膜识别并去除分离目标语音中情感影响最大的频谱时间区域。换句话说,我们提出的系统结合了噪声和情感的两个估计掩码(二进制掩码和情感特定掩码),作为降低嘈杂情绪语音的单词错误率的一种方法。所提出的双耳系统的性能在干净的中性火车/嘈杂的情绪测试场景中针对不同的噪声类型、信噪比和源的空间配置进行了评估。波斯语情感语音数据库的语音用于实验目的。仿真结果表明,所提出的系统实现了更高的性能,
更新日期:2018-08-28
down
wechat
bug