Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition
arXiv - CS - Sound Pub Date : 2019-07-13 , DOI: arxiv-1907.06078
Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Julien Epps, and Bj\"orn W. Schuller

Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for developing any robust machine learning model in general. In this paper, we propose a solution to this problem: a multi-task learning framework that uses auxiliary tasks for which data is abundantly available. We show that utilisation of this additional data can improve the primary task of SER for which only limited labelled data is available. In particular, we use gender identifications and speaker recognition as auxiliary tasks, which allow the use of very large datasets, e.g., speaker classification datasets. To maximise the benefit of multi-task learning, we further use an adversarial autoencoder (AAE) within our framework, which has a strong capability to learn powerful and discriminative features. Furthermore, the unsupervised AAE in combination with the supervised classification networks enables semi-supervised learning which incorporates a discriminative component in the AAE unsupervised training pipeline. This semi-supervised learning essentially helps to improve generalisation of our framework and thus leads to improvements in SER performance. The proposed model is rigorously evaluated for categorical and dimensional emotion, and cross-corpus scenarios. Experimental results demonstrate that the proposed model achieves state-of-the-art performance on two publicly available datasets.

中文翻译：

用于语音情感识别的多任务半监督对抗性自动编码

尽管语音情感识别 (SER) 的重要性日益凸显，但最先进的准确度非常低，需要改进才能使 SER 的商业应用可行。准确性低的一个关键根本原因是情感数据集的稀缺性，这对于开发任何强大的机器学习模型来说都是一个挑战。在本文中，我们提出了一个解决这个问题的方法：一个多任务学习框架，它使用数据丰富的辅助任务。我们表明，利用这些额外数据可以改进 SER 的主要任务，因为只有有限的标记数据可用。特别是，我们使用性别识别和说话人识别作为辅助任务，这允许使用非常大的数据集，例如说话人分类数据集。为了最大化多任务学习的好处，我们在我们的框架中进一步使用了对抗性自动编码器（AAE），它具有很强的学习强大和判别特征的能力。此外，无监督的 AAE 与有监督的分类网络相结合，可以实现半监督学习，该学习在 AAE 无监督训练管道中加入了判别组件。这种半监督学习本质上有助于提高我们框架的泛化能力，从而提高 SER 性能。所提出的模型针对分类和维度情感以及跨语料库场景进行了严格评估。实验结果表明，所提出的模型在两个公开可用的数据集上实现了最先进的性能。它具有很强的学习强大和判别特征的能力。此外，无监督的 AAE 与有监督的分类网络相结合，可以实现半监督学习，该学习在 AAE 无监督训练管道中加入了判别组件。这种半监督学习本质上有助于提高我们框架的泛化能力，从而提高 SER 性能。所提出的模型针对分类和维度情感以及跨语料库场景进行了严格评估。实验结果表明，所提出的模型在两个公开可用的数据集上实现了最先进的性能。它具有很强的学习强大和判别特征的能力。此外，无监督的 AAE 与有监督的分类网络相结合，可以实现半监督学习，该学习在 AAE 无监督训练管道中加入了判别组件。这种半监督学习本质上有助于提高我们框架的泛化能力，从而提高 SER 性能。所提出的模型针对分类和维度情感以及跨语料库场景进行了严格评估。实验结果表明，所提出的模型在两个公开可用的数据集上实现了最先进的性能。无监督的 AAE 与有监督的分类网络相结合，可以实现半监督学习，该学习在 AAE 无监督训练管道中加入了判别组件。这种半监督学习本质上有助于提高我们框架的泛化能力，从而提高 SER 性能。所提出的模型针对分类和维度情感以及跨语料库场景进行了严格评估。实验结果表明，所提出的模型在两个公开可用的数据集上实现了最先进的性能。无监督的 AAE 与有监督的分类网络相结合，可以实现半监督学习，该学习在 AAE 无监督训练管道中加入了判别组件。这种半监督学习本质上有助于提高我们框架的泛化能力，从而提高 SER 性能。所提出的模型针对分类和维度情感以及跨语料库场景进行了严格评估。实验结果表明，所提出的模型在两个公开可用的数据集上实现了最先进的性能。所提出的模型针对分类和维度情感以及跨语料库场景进行了严格评估。实验结果表明，所提出的模型在两个公开可用的数据集上实现了最先进的性能。所提出的模型针对分类和维度情感以及跨语料库场景进行了严格评估。实验结果表明，所提出的模型在两个公开可用的数据集上实现了最先进的性能。

更新日期：2020-03-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文