Speech Emotion Recognition using Semantic Information,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Speech Emotion Recognition using Semantic Information
arXiv - CS - Sound Pub Date : 2021-03-04 , DOI: arxiv-2103.02993
Panagiotis Tzirakis, Anh Nguyen, Stefanos Zafeiriou, Björn W. Schuller

Speech emotion recognition is a crucial problem manifesting in a multitude of applications such as human computer interaction and education. Although several advancements have been made in the recent years, especially with the advent of Deep Neural Networks (DNN), most of the studies in the literature fail to consider the semantic information in the speech signal. In this paper, we propose a novel framework that can capture both the semantic and the paralinguistic information in the signal. In particular, our framework is comprised of a semantic feature extractor, that captures the semantic information, and a paralinguistic feature extractor, that captures the paralinguistic information. Both semantic and paraliguistic features are then combined to a unified representation using a novel attention mechanism. The unified feature vector is passed through a LSTM to capture the temporal dynamics in the signal, before the final prediction. To validate the effectiveness of our framework, we use the popular SEWA dataset of the AVEC challenge series and compare with the three winning papers. Our model provides state-of-the-art results in the valence and liking dimensions.

中文翻译：

使用语义信息的语音情感识别

语音情感识别是一个至关重要的问题，体现在诸如人机交互和教育之类的众多应用中。尽管近年来已经取得了一些进步，特别是随着深度神经网络（DNN）的出现，但是文献中的大多数研究都没有考虑语音信号中的语义信息。在本文中，我们提出了一种新颖的框架，该框架可以捕获信号中的语义信息和副语言信息。特别地，我们的框架由捕获语义信息的语义特征提取器和捕获副语言信息的副语言特征提取器组成。然后，使用新颖的注意力机制将语义特征和副语言特征都组合为一个统一的表示形式。在最终预测之前，统一的特征向量通过LSTM捕获信号中的时间动态。为了验证我们框架的有效性，我们使用了AVEC挑战系列的流行SEWA数据集，并与三篇获奖论文进行了比较。我们的模型在化合价和喜好维度上提供了最新的结果。

更新日期：2021-03-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文