End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2021-05-12 , DOI: 10.1186/s13636-021-00208-5
Duowei Tang ₁ , Peter Kuppens ₂ , Luc Geurts _{1,

3} , Toon van Waterschoot ₁

Affiliation

Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.

中文翻译：

使用新型上下文堆叠扩张卷积神经网络进行端到端语音情感识别

在语音信号的各种特征中，情感表达是表现出最慢时间动态的特征之一。因此，高性能语音情感识别 (SER) 系统需要一个预测模型，该模型能够在所分析的语音信号中学习足够长的时间依赖性。因此，在这项工作中，我们提出了一种基于具有上下文堆叠的扩张因果卷积概念的新型端到端神经网络架构。首先，所提出的模型仅由可并行化的层组成，因此适用于并行处理，同时避免了循环神经网络 (RNN) 层固有的并行性不足。其次，专用扩张因果卷积块的设计允许模型具有与输入序列长度一样大的感受野，同时保持合理的低计算成本。第三，通过引入上下文堆叠结构，所提出的模型能够利用长期时间依赖性，因此提供了使用 RNN 层的替代方案。我们在 SER 回归和分类任务中评估所提出的模型，并提供与最先进的端到端 SER 模型的比较。实验结果表明，所提出的模型只需要最先进模型中使用的模型参数数量的 1/3，同时也显着提高了 SER 性能。报告了进一步的实验以了解使用各种类型的输入表示的影响（即原始音频样本与对数梅尔频谱图）并说明端到端方法相对于使用手工制作的音频功能的好处。此外，我们表明所提出的模型可以有效地学习保留语音情感信息的中间嵌入。

更新日期：2021-05-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文