When Old Meets New: Emotion Recognition from Speech Signals,Cognitive Computation

当前位置： X-MOL 学术 › Cognit. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

When Old Meets New: Emotion Recognition from Speech Signals
Cognitive Computation ( IF 4.3 ) Pub Date : 2021-04-19 , DOI: 10.1007/s12559-021-09865-2
Keith April Araño , Peter Gloor , Carlotta Orsenigo , Carlo Vercellis

Speech is one of the most natural communication channels for expressing human emotions. Therefore, speech emotion recognition (SER) has been an active area of research with an extensive range of applications that can be found in several domains, such as biomedical diagnostics in healthcare and human–machine interactions. Recent works in SER have been focused on end-to-end deep neural networks (DNNs). However, the scarcity of emotion-labeled speech datasets inhibits the full potential of training a deep network from scratch. In this paper, we propose new approaches for classifying emotions from speech by combining conventional mel-frequency cepstral coefficients (MFCCs) with image features extracted from spectrograms by a pretrained convolutional neural network (CNN). Unlike prior studies that employ end-to-end DNNs, our methods eliminate the resource-intensive network training process. By using the best prediction model obtained, we also build an SER application that predicts emotions in real time. Among the proposed methods, the hybrid feature set fed into a support vector machine (SVM) achieves an accuracy of 0.713 in a 6-class prediction problem evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, which is higher than the previously published results. Interestingly, MFCCs taken as unique input into a long short-term memory (LSTM) network achieve a slightly higher accuracy of 0.735. Our results reveal that the proposed approaches lead to an improvement in prediction accuracy. The empirical findings also demonstrate the effectiveness of using a pretrained CNN as an automatic feature extractor for the task of emotion prediction. Moreover, the success of the MFCC-LSTM model is evidence that, despite being conventional features, MFCCs can still outperform more sophisticated deep-learning feature sets.

中文翻译：

当旧与新相遇时：来自语音信号的情感识别

语音是表达人类情感的最自然的交流渠道之一。因此，语音情感识别（SER）一直是研究的活跃领域，可以在多个领域中找到广泛的应用，例如医疗保健中的生物医学诊断和人机交互。SER的最新工作集中在端到端深度神经网络（DNN）。但是，带有情感标签的语音数据集的匮乏抑制了从头开始训练深度网络的全部潜力。在本文中，我们提出了一种通过将传统的梅尔频率倒谱系数（MFCC）与通过预训练卷积神经网络（CNN）从频谱图提取的图像特征相结合来对语音中的情感进行分类的新方法。与先前采用端到端DNN的研究不同，我们的方法消除了资源密集型网络培训过程。通过使用获得的最佳预测模型，我们还构建了可以实时预测情绪的SER应用程序。在提出的方法中，馈入支持向量机（SVM）的混合特征集在情感语音和歌曲的Ryerson视听数据库（RAVDESS）数据集上评估的6类预测问题中达到0.713的准确度，即高于先前公布的结果。有趣的是，作为长期短时记忆（LSTM）网络的唯一输入的MFCC达到了0.735的更高精度。我们的结果表明，所提出的方法可以提高预测精度。经验发现还证明了使用预训练的CNN作为情感预测任务的自动特征提取器的有效性。而且，MFCC-LSTM模型的成功证明，尽管MFCC是常规功能，但它们仍然可以胜过更复杂的深度学习功能集。

更新日期：2021-04-19

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11