Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features,Circuits, Systems, and Signal Processing

当前位置： X-MOL 学术 › Circuits Syst. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features
Circuits, Systems, and Signal Processing ( IF 2.3 ) Pub Date : 2020-05-14 , DOI: 10.1007/s00034-020-01429-3
Starlet Ben Alex , Leena Mary , Ben P. Babu

This work attempts to recognize emotions from human speech using prosodic information represented by variations in duration, energy, and fundamental frequency ( $$F_{0}$$ ) values. For this, the speech signal is first automatically segmented into syllables. Prosodic features at the utterance (15 features) and syllable level (10 features) are extracted using the syllable boundaries and trained separately using deep neural network classifiers. The effectiveness of the proposed approach is demonstrated on German speech corpus-EMOTional Sensitivity ASistance System (EmotAsS) for people with disabilities, the dataset used for the Interspeech 2018 Atypical Affect Sub-Challenge. The initial set of prosodic features on evaluation yields an unweighted average recall (UAR) of 30.15%. A fusion of the decision scores of these features with spectral features gives a UAR of 36.71%. This paper also employs methods like attention mechanism and feature selection using resampling-based recursive feature elimination (RFE) to enhance system performance. Implementing attention and feature selection followed by a score-level fusion improves the UAR to 36.83% and 40.96% for prosodic features and overall fusion, respectively. The fusion of the scores of the best individual system of the Atypical Affect Sub-Challenge and the proposed system provides a UAR (43.71%) above the best test result reported. The effectiveness of the proposed system has also been demonstrated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database with a UAR of 63.83%.

中文翻译：

使用话语和音节级韵律特征的自动语音情感识别的注意力和特征选择

这项工作尝试使用由持续时间、能量和基频 ( $$F_{0}$$ ) 值的变化表示的韵律信息来识别人类语音中的情绪。为此，语音信号首先被自动分割成音节。使用音节边界提取话语（15 个特征）和音节级别（10 个特征）的韵律特征，并使用深度神经网络分类器分别训练。所提出方法的有效性在针对残疾人的德语语音语料库 - 情感敏感性辅助系统 (EmotAsS) 上得到了证明，该数据集用于 Interspeech 2018 非典型影响子挑战。评估的初始韵律特征集产生 30.15% 的未加权平均召回率 (UAR)。这些特征的决策分数与光谱特征的融合给出了 36.71% 的 UAR。本文还采用了使用基于重采样的递归特征消除 (RFE) 的注意机制和特征选择等方法来提高系统性能。实施注意力和特征选择，然后进行分数级融合，分别将韵律特征和整体融合的 UAR 提高到 36.83% 和 40.96%。非典型影响子挑战的最佳个人系统的分数与建议系统的分数的融合提供了高于报告的最佳测试结果的 UAR (43.71%)。所提出系统的有效性也已在交互式情绪二元运动捕捉 (IEMOCAP) 数据库上得到证明，其 UAR 为 63.83%。本文还采用了使用基于重采样的递归特征消除 (RFE) 的注意机制和特征选择等方法来提高系统性能。实施注意力和特征选择，然后进行分数级融合，分别将韵律特征和整体融合的 UAR 提高到 36.83% 和 40.96%。非典型影响子挑战的最佳个人系统的分数与建议系统的分数的融合提供了高于报告的最佳测试结果的 UAR (43.71%)。所提出系统的有效性也已在交互式情绪二元运动捕捉 (IEMOCAP) 数据库上得到证明，其 UAR 为 63.83%。本文还采用了使用基于重采样的递归特征消除 (RFE) 的注意机制和特征选择等方法来提高系统性能。实施注意力和特征选择，然后进行分数级融合，分别将韵律特征和整体融合的 UAR 提高到 36.83% 和 40.96%。非典型影响子挑战的最佳个人系统的分数与建议系统的分数的融合提供了高于报告的最佳测试结果的 UAR (43.71%)。所提出系统的有效性也已在交互式情绪二元运动捕捉 (IEMOCAP) 数据库上得到证明，其 UAR 为 63.83%。实施注意力和特征选择，然后进行分数级融合，分别将韵律特征和整体融合的 UAR 提高到 36.83% 和 40.96%。非典型影响子挑战的最佳个人系统的分数与建议系统的分数的融合提供了高于报告的最佳测试结果的 UAR (43.71%)。所提出系统的有效性也已在交互式情绪二元运动捕捉 (IEMOCAP) 数据库上得到证明，其 UAR 为 63.83%。实施注意力和特征选择，然后进行分数级融合，分别将韵律特征和整体融合的 UAR 提高到 36.83% 和 40.96%。非典型影响子挑战的最佳个人系统的分数与建议系统的分数的融合提供了高于报告的最佳测试结果的 UAR (43.71%)。所提出系统的有效性也已在交互式情绪二元运动捕捉 (IEMOCAP) 数据库上得到证明，其 UAR 为 63.83%。

更新日期：2020-05-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>