Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech,IEEE Transactions on Affective Computing

当前位置： X-MOL 学术 › IEEE Trans. Affect. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech
IEEE Transactions on Affective Computing ( IF 11.2 ) Pub Date : 2022-06-30 , DOI: 10.1109/taffc.2022.3187336
Kusha Sridhar ₁ , Carlos Busso ₁

Affiliation

The prediction of valence from speech is an important, but challenging problem. The expression of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target speakers in the test set. Adapting a speech emotion recognition (SER) system to a particular speaker is a hard problem, especially with deep neural networks (DNNs), since it requires optimizing millions of parameters. This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. Speech samples from the selected speakers are used to create the adaptation set. This approach leverages transfer learning using pre-trained models, which are adapted with these speech samples. We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches. These methods differ on the use of the adaptation set in the personalization of the valence models. The results demonstrate that a valence prediction model can be efficiently personalized with these unsupervised approaches, leading to relative improvements as high as 13.52%.

中文翻译：

情绪识别系统的无监督个性化：语音效价外化的独特属性

从语音预测效价是一个重要但具有挑战性的问题。语音中效价的表达具有依赖于说话者的线索，这有助于表现通常明显低于其他情绪属性（如唤醒和支配）的预测。改进语音效价预测的一种实用方法是使模型适应测试集中的目标说话人。适应一个特定说话人的语音情感识别 (SER) 系统是一个难题，尤其是深度神经网络 (DNN)，因为它需要优化数百万个参数。这项研究提出了一种无监督的方法来解决这个问题，方法是在训练集中搜索与测试集中的说话人具有相似声学模式的说话人。来自所选说话者的语音样本用于创建自适应集。这种方法利用使用这些语音样本进行调整的预训练模型的迁移学习。我们提出了三种可供选择的自适应策略：独特的说话人、过采样和加权方法。这些方法的不同之处在于在价模型的个性化中使用适应集。结果表明，可以使用这些无监督方法有效地个性化效价预测模型，导致相对改进高达 13.52%。

更新日期：2022-06-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>