当前位置: X-MOL 学术Appl. Acoust. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition
Applied Acoustics ( IF 3.4 ) Pub Date : 2021-07-07 , DOI: 10.1016/j.apacoust.2021.108260
Orhan Atila 1 , Abdulkadir Şengür 1
Affiliation  

In this paper, a novel approach, which is based on attention guided 3D convolutional neural networks (CNN)-long short-term memory (LSTM) model, is proposed for speech based emotion recognition. The proposed attention guided 3D CNN-LSTM model is trained in end-to-end fashion. The input speech signals are initially resampled and pre-processed for noise removing and emphasizing the high frequencies. Then, spectrogram, Mel-frequency cepstral coefficient (MFCC), cochleagram and fractal dimension methods are used to convert the input speech signals into the speech images. The obtained images are concatenated into four-dimensional volumes and used as input to the developed 28 layered attention integrated 3D CNN-LSTM model. In the 3D CNN-LSTM model, there are six 3D convolutional layers, two batch normalization (BN) layers, five Rectified Linear Unit (ReLu) layers, three 3D max pooling layers, one attention, one LSTM, one flatten and one dropout layers, and two fully connected layers. The attention layer is connected to the 3D convolution layers. Three datasets namely Ryerson Audio-Visual Database of Emotional Speech (RAVDESS), RML and SAVEE are used in the experimental works. Besides, the mixture of these datasets is also used in the experimental works. Classification accuracy, sensitivity, specificity and F1-score are used for evaluation of the developed method. The obtained results are also compared with some of the recently published results and it is seen that the proposed method outperforms the compared methods.



中文翻译:

注意力引导 3D CNN-LSTM 模型,用于基于语音的准确情感识别

在本文中,提出了一种基于注意力引导的 3D 卷积神经网络 (CNN)-长短期记忆 (LSTM) 模型的新方法,用于基于语音的情感识别。建议的注意力引导 3D CNN-LSTM 模型以端到端的方式进行训练。输入语音信号最初被重新采样和预处理以去除噪声和强调高频。然后,使用频谱图、梅尔频率倒谱系数(MFCC)、耳蜗图和分形维数方法将输入的语音信号转换为语音图像。将获得的图像连接成四维体积,并用作开发的 28 层注意力集成 3D CNN-LSTM 模型的输入。在 3D CNN-LSTM 模型中,有六个 3D 卷积层,两个批量归一化 (BN) 层,五个整流线性单元 (ReLu) 层、三个 3D 最大池化层、一个注意力层、一个 LSTM、一个展平层和一个 dropout 层,以及两个完全连接的层。注意层连接到 3D 卷积层。实验工作使用了三个数据集,即 Ryerson 情感语音视听数据库 (RAVDESS)、RML 和 SAVEE。此外,这些数据集的混合也用于实验工作。分类准确性、敏感性、特异性和 F1 分数用于评估开发的方法。获得的结果也与最近发表的一些结果进行了比较,可以看出所提出的方法优于比较方法。注意层连接到 3D 卷积层。实验工作使用了三个数据集,即 Ryerson 情感语音视听数据库 (RAVDESS)、RML 和 SAVEE。此外,这些数据集的混合也用于实验工作。分类准确性、敏感性、特异性和 F1 分数用于评估开发的方法。获得的结果也与最近发表的一些结果进行了比较,可以看出所提出的方法优于比较方法。注意层连接到 3D 卷积层。实验工作使用了三个数据集,即 Ryerson 情感语音视听数据库 (RAVDESS)、RML 和 SAVEE。此外,这些数据集的混合也用于实验工作。分类准确性、敏感性、特异性和 F1 分数用于评估开发的方法。获得的结果也与最近发表的一些结果进行了比较,可以看出所提出的方法优于比较方法。特异性和 F1 分数用于评估开发的方法。获得的结果也与最近发表的一些结果进行了比较,可以看出所提出的方法优于比较方法。特异性和 F1 分数用于评估开发的方法。获得的结果也与最近发表的一些结果进行了比较,可以看出所提出的方法优于比较方法。

更新日期:2021-07-08
down
wechat
bug