A multiple feature fusion framework for video emotion recognition in the wild,Concurrency and Computation: Practice and Experience

当前位置： X-MOL 学术 › Concurr. Comput. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A multiple feature fusion framework for video emotion recognition in the wild
Concurrency and Computation: Practice and Experience ( IF 1.5 ) Pub Date : 2020-04-07 , DOI: 10.1002/cpe.5764
Najmeh Samadiani ₁ , Guangyan Huang ₁ , Wei Luo ₁ , Chi-Hung Chi ₂ , Yanfeng Shu ₂ , Rui Wang ₃ , Tuba Kocaturk ₃

Affiliation

Human emotions can be recognized from facial expressions captured in videos. It is a growing research area in which many have attempted to improve video emotion detection in both lab-controlled and unconstrained environments. While existing methods show a decent recognition accuracy on lab-controlled datasets, they deliver much lower accuracy in a real-world uncontrolled environment, where a variety of challenges need to be addressed such as variations in illumination, head pose, and individual appearance. Moreover, automatically identifying the key frames consisting of the expression from real-world videos is another challenge. In this article, to overcome these challenges, we provide a video emotion recognition via multiple feature fusion method. First, a uniform local binary pattern (LBP) and the scale-invariant feature transform features are extracted from each frame in the video sequences. By applying a random forest classifier, all of the static frames are then labelled by the related emotion class. In this way, the key frames can be automatically identified, including neutral and other expressions. Furthermore, from the key frames, a new geometric feature vector and the LBP from three orthogonal planes are extracted. To further improve robustness, audio features are extracted from the video sequences as an additional dimension to augmenting visual facial expression analysis. The audio and visual features are fused through a kernel multimodal sparse representation. Finally, the corresponding emotion labels to the video sequences can be assigned when a multimodal quality measure specifies the quality of each modality and its role in the decision. The results on both acted facial expressions in the Wild and MMI datasets demonstrate that the proposed method outperforms several counterpart video emotion recognition methods.

中文翻译：

一种用于野外视频情感识别的多特征融合框架

人类的情绪可以从视频中捕捉到的面部表情中识别出来。这是一个不断发展的研究领域，许多人试图在实验室控制和不受约束的环境中改进视频情感检测。虽然现有方法在实验室控制的数据集上显示出不错的识别精度，但它们在现实世界的不受控制的环境中提供的准确度要低得多，在这种环境中需要解决各种挑战，例如照明、头部姿势和个人外观的变化。此外，自动识别由真实视频中的表情组成的关键帧是另一个挑战。在本文中，为了克服这些挑战，我们提供了一种通过多特征融合方法进行的视频情感识别。第一的，从视频序列的每一帧中提取统一的局部二进制模式（LBP）和尺度不变特征变换特征。通过应用随机森林分类器，所有的静态帧都被相关的情感类标记。这样就可以自动识别出关键帧，包括中性和其他表情。此外，从关键帧中，提取了一个新的几何特征向量和来自三个正交平面的 LBP。为了进一步提高鲁棒性，从视频序列中提取音频特征作为增强视觉面部表情分析的附加维度。音频和视觉特征通过内核多模态稀疏表示进行融合。最后，当多模态质量度量指定每个模态的质量及其在决策中的作用时，可以为视频序列分配相应的情感标签。Wild 和 MMI 数据集中的面部表情的结果表明，所提出的方法优于几种对应的视频情感识别方法。

更新日期：2020-04-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文