当前位置: X-MOL 学术Multimed. Tools Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning
Multimedia Tools and Applications ( IF 3.6 ) Pub Date : 2021-01-20 , DOI: 10.1007/s11042-020-10399-2
Youddha Beer Singh , Shivani Goel

Automatic emotion recognition from speech is a demanding and challenging problem. It is difficult to differentiate between the emotional states of humans. The major problem with this task is to extract the important features from the speech in case of hand-crafted features. The accuracy for emotion recognition can be increased using deep learning approaches which use high level features of speech signals. In this work, an algorithm is proposed using deep learning to extract the high-level features from raw data with high accuracy irrespective of language and speakers (male/females) of speech corpora. For this, the .wav files are converted into the RGB spectrograms (images) and normalized to size (224x224x3) for fine-tuning these for Deep Convolutional Neural Network (DCNN) to recognize emotions. DCNN model is trained in two stages. From stage-1 the optimal learning rate is identified using the Learning Rate (LR) range test and then the model is trained again with optimal learning rate in stage-2. Special stride is used for down-sampling the features with reduced model size. The emotions considered are happiness, sadness, anger, fear, disgust, boredom/surprise and neutral. The proposed algorithm is tested on three popular public speech corpora EMODB (German), EMOVO (Italian), and SAVEE (British English). The accuracy of emotion recognition reported is better as compared to the existing studies for different languages and speakers.



中文翻译:

使用深度学习从说话者和与语言无关的语音中识别情绪的有效算法

来自语音的自动情感识别是一个艰巨而具有挑战性的问题。很难区分人类的情绪状态。这项任务的主要问题是在手工制作特征的情况下从语音中提取重要特征。使用深度学习方法可以提高情感识别的准确性,该方法使用语音信号的高级功能。在这项工作中,提出了一种使用深度学习的算法,该算法可以从原始数据中高精度地提取高级特征,而与语音语料库的语言和说话者(男性/女性)无关。为此,将.wav文件转换为RGB频谱图(图像),并规格化为大小(224x224x3),以针对深度卷积神经网络(DCNN)进行微调以识别情绪。DCNN模型分两个阶段进行训练。从阶段1开始,使用学习率(LR)范围测试确定最佳学习率,然后在阶段2以最佳学习率再次训练模型。特殊跨度用于以减小的模型尺寸对特征进行下采样。所考虑的情绪是幸福,悲伤,愤怒,恐惧,厌恶,无聊/惊奇和中立。该算法在三种流行的公共语音语料库EMODB(德语),EMOVO(意大利语)和SAVEE(英式英语)上进行了测试。与针对不同语言和说话者的现有研究相比,所报道的情绪识别的准确性更高。所考虑的情绪是幸福,悲伤,愤怒,恐惧,厌恶,无聊/惊奇和中立。该算法在三种流行的公共语音语料库EMODB(德语),EMOVO(意大利语)和SAVEE(英式英语)上进行了测试。与针对不同语言和说话者的现有研究相比,所报道的情绪识别的准确性更高。所考虑的情绪是幸福,悲伤,愤怒,恐惧,厌恶,无聊/惊奇和中立。该算法在三种流行的公共语音语料库EMODB(德语),EMOVO(意大利语)和SAVEE(英式英语)上进行了测试。与针对不同语言和说话者的现有研究相比,所报道的情绪识别的准确性更高。

更新日期:2021-01-21
down
wechat
bug