Quality-Aware Bag of Modulation Spectrum Features for Robust Speech Emotion Recognition,IEEE Transactions on Affective Computing

当前位置： X-MOL 学术 › IEEE Trans. Affect. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Quality-Aware Bag of Modulation Spectrum Features for Robust Speech Emotion Recognition
IEEE Transactions on Affective Computing ( IF 9.6 ) Pub Date : 2022-07-04 , DOI: 10.1109/taffc.2022.3188223
Shruti Rajendra Kshirsagar ₁ , Tiago Henrik Falk ₁

Affiliation

Automatic speech emotion recognition (SER) has gained popularity over the last decade and numerous Challenges have emerged. While the latest Challenges have shown that deep neural networks achieve the best results, existing input features are still a bottleneck and cause severe performance degradation in realistic “in-the-wild” scenarios. In this paper, we propose two innovations to tackle this issue. First, we propose to combine the bag-of-audio-words methodology with modulation spectrum features for environmental robustness. Second, we take advantage of the inherent quality-awareness properties of modulation spectrum and propose the use of a quality feature as an additional feature to be used by the speech emotion recognizer. Experiments are conducted with three multi-lingual speech datasets used in recent SER Challenges degraded by different noise sources and levels, and room reverberation. Experimental results show the proposed features i) consistently outperforming benchmark systems, ii) providing complementary information to classical features, hence improving performance with feature fusion, and iii) showing robustness against environment and language mismatch. Moreover, we show that when the proposed system is provided with quality information, further improvements are obtained. Overall, the proposed bag of modulation spectrum features are shown to be a promising candidate for “in-the-wild” SER.

中文翻译：

用于稳健语音情感识别的调制频谱特征质量感知包

自动语音情感识别 (SER) 在过去十年中越来越受欢迎，并且出现了许多挑战。虽然最新的挑战表明深度神经网络取得了最好的结果，但现有的输入特征仍然是一个瓶颈，并在现实的“野外”场景中导致严重的性能下降。在本文中，我们提出了两项创新来解决这个问题。首先，我们建议将 bag-of-audio-words 方法与调制频谱特征相结合，以实现环境鲁棒性。其次，我们利用调制频谱固有的质量感知特性，并建议使用质量特征作为语音情感识别器使用的附加特征。实验使用最近 SER 挑战中使用的三种多语言语音数据集进行，这些语音数据集因不同的噪声源和级别以及房间混响而降级。实验结果表明所提出的特征 i) 始终优于基准系统，ii) 为经典特征提供补充信息，从而通过特征融合提高性能，以及 iii) 显示出对环境和语言不匹配的鲁棒性。此外，我们表明，当所提出的系统提供质量信息时，可以获得进一步的改进。总体而言，所提出的调制频谱特征包被证明是“野外”SER 的有前途的候选者。ii) 为经典特征提供补充信息，从而通过特征融合提高性能，以及 iii) 显示出对环境和语言不匹配的鲁棒性。此外，我们表明，当所提出的系统提供质量信息时，可以获得进一步的改进。总体而言，所提出的调制频谱特征包被证明是“野外”SER 的有前途的候选者。ii) 为经典特征提供补充信息，从而通过特征融合提高性能，以及 iii) 显示出对环境和语言不匹配的鲁棒性。此外，我们表明，当所提出的系统提供质量信息时，可以获得进一步的改进。总体而言，所提出的调制频谱特征包被证明是“野外”SER 的有前途的候选者。

更新日期：2022-07-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11