Deep emotion recognition based on audio–visual correlation,IET Computer Vision

当前位置： X-MOL 学术 › IET Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep emotion recognition based on audio–visual correlation
IET Computer Vision ( IF 1.5 ) Pub Date : 2020-11-16 , DOI: 10.1049/iet-cvi.2020.0013
Noushin Hajarolasvadi ₁ , Hasan Demirel ₁

Affiliation

Human emotion recognition is studied by means of unimodal channels over the last decade. However, efforts continue to answer tempting questions about how variant modalities can complement each other. This study proposes a multimodal approach using three-dimensional (3D) convolutional neural networks (CNNs) to model human emotion through a modality-referenced system while investigating the solution to such questions. The proposed modality-referenced system selects the input data based on one of the modalities regarded as reference or master. The other modality which is referred to as a slave simply adjusts or attunes itself with the master in the temporal domain. In this context, the authors developed three multimodal emotion recognition system, namely, video-referenced system, audio-referenced system, and the audio–visual-referenced system to explore the congruence impact of audio and video modalities on each other. Two pipelines of 3D CNN architectures are employed where k -means clustering is used in the master pipeline and the slave pipeline adapts itself in a temporal sense. The outputs of the two pipelines are fused to improve recognition performance. In addition, canonical correlation analysis and t -distributed stochastic neighbour embedding is used validating the experiments. Results show that temporal alignment of the data between two modalities improves the recognition performance significantly.

中文翻译：

基于视听关联的深度情感识别

在过去的十年中，人们通过单峰渠道研究了人类的情感识别。但是，人们继续努力回答有关变体形式如何相互补充的诱人问题。这项研究提出了一种多模式方法，该方法使用三维（3D）卷积神经网络（CNN）通过模态参考系统对人类情绪进行建模，同时研究此类问题的解决方案。所提出的模态参考系统基于被视为参考或主模态的模态之一来选择输入数据。被称为从机的另一种方式只是在时域中与主机进行自我调整或调和。在此背景下，作者开发了三种多模式情感识别系统，即视频参考系统，音频参考系统，以及视听参考系统，以探索音频和视频模态对彼此的影响。采用3D CNN架构的两个管道，其中ķ -means聚类用于主管道中，而从管道在时间意义上适应自身。融合了两个管道的输出以提高识别性能。此外，规范相关分析和Ť 分布式随机邻居嵌入用于验证实验。结果表明，两种模态之间数据的时间对齐可显着提高识别性能。

更新日期：2020-11-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11