Two-level discriminative speech emotion recognition model with wave field dynamics: A personalized speech emotion recognition method,Computer Communications

当前位置： X-MOL 学术 › Comput. Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Two-level discriminative speech emotion recognition model with wave field dynamics: A personalized speech emotion recognition method
Computer Communications ( IF 4.5 ) Pub Date : 2021-09-22 , DOI: 10.1016/j.comcom.2021.09.013
Ning Jia ₁ , Chunjun Zheng ₁

Affiliation

Presently available speech emotion recognition (SER) methods generally rely on a single SER model. Getting a higher accuracy of SER involves feature extraction method and model design scheme in the speech. However, the generalization performance of models is typically poor because the emotional features of different speakers can vary substantially. The present work addresses this issue by applying a two-level discriminative model to the SER task. The first level places an individual speaker within a specific speaker group according to the speaker’s characteristics. The second level constructs a personalized SER model for each group of speakers using the wave field dynamics model and a dual-channel general SER model. Two-level discriminative model are fused for implementing an ensemble learning scheme to achieve effective SER classification. The proposed method is demonstrated to provide higher SER accuracy in experiments based on interactive emotional dynamic motion capture (IEMOCAP) corpus and a custom-built SER corpus. In IEMOCAP corpus, the proposed model improves the recognition accuracy by 7%. In custom-built SER corpus, both masked and unmasked speakers is employed to demonstrate that the proposed method maintains higher SER accuracy.

中文翻译：

具有波场动力学的两级判别语音情感识别模型：一种个性化语音情感识别方法

目前可用的语音情感识别 (SER) 方法通常依赖于单个 SER 模型。获得更高的SER准确率涉及语音中的特征提取方法和模型设计方案。然而，模型的泛化性能通常很差，因为不同说话者的情感特征可能会有很大差异。目前的工作通过将两级判别模型应用于 SER 任务来解决这个问题。第一级根据扬声器的特性将单个扬声器放置在特定扬声器组中。第二层使用波场动力学模型和双通道通用SER模型为每组扬声器构建个性化SER模型。融合了两级判别模型以实施集成学习方案以实现有效的 SER 分类。在基于交互式情感动态动作捕捉 (IEMOCAP) 语料库和定制的 SER 语料库的实验中，所提出的方法被证明可以提供更高的 SER 精度。在 IEMOCAP 语料库中，所提出的模型将识别准确率提高了 7%。在定制的 SER 语料库中，使用掩码和未掩码的说话者来证明所提出的方法保持更高的 SER 精度。

更新日期：2021-09-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11