Learning Facial Action Units with Spatiotemporal Cues and Multi-label Sampling.,Image and Vision Computing

当前位置： X-MOL 学术 › Image Vis. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Facial Action Units with Spatiotemporal Cues and Multi-label Sampling.
Image and Vision Computing ( IF 4.2 ) Pub Date : 2018-10-28 , DOI: 10.1016/j.imavis.2018.10.002
Wen-Sheng Chu ₁ , Fernando De la Torre ₁ , Jeffrey F Cohn ₂

Affiliation

Facial action units (AUs) can be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during network training, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.

中文翻译：

通过时空线索和多标签采样学习面部动作单元。

面部动作单元 (AU) 可以在空间上、时间上以及它们的相关性方面进行表示。先前的研究集中于这些方面中的一个或另一个方面，或者不连贯地解决它们。我们提出了一种混合网络架构，可以联合建模空间和时间表示及其相关性。特别是，我们使用卷积神经网络（CNN）来学习空间表示，并使用长短期记忆（LSTM）来模拟它们之间的时间依赖性。 CNN 和 LSTM 的输出被聚合到融合网络中，以生成多个 AU 的每帧预测。该混合网络与两个大型 FACS 编码视频数据库（GFT 和 BP4D）中先前最先进的方法进行了比较，其中包含超过 400,000 个不同社会背景下自发面部行为的 AU 编码帧。相对于标准的多标签 CNN 和基于特征的最先进方法，混合系统减少了特定于人的偏差，并提高了 AU 检测的准确性。为了解决网络训练期间批次内和批次之间的类别不平衡问题，我们引入了多标签采样策略，该策略可以在 AU 相对稀疏时进一步提高准确性。最后，我们提供了所学习的 AU 模型的可视化，据我们所知，该模型首次揭示了机器如何看待 AU。

更新日期：2018-10-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11