Attended End-to-end Architecture for Age Estimation from Facial Expression Videos.,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Attended End-to-end Architecture for Age Estimation from Facial Expression Videos.
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2019-10-24 , DOI: 10.1109/tip.2019.2948288
Wenjie Pei , Hamdi Dibeklioglu , Tadas Baltrusaitis , David M J Tax

The main challenges of age estimation from facial expression videos lie not only in the modeling of the static facial appearance, but also in the capturing of the temporal facial dynamics. Traditional techniques to this problem focus on constructing handcrafted features to explore the discriminative information contained in facial appearance and dynamics separately. This relies on sophisticated feature-refinement and framework-design. In this paper, we present an end-toend architecture for age estimation, called Spatially-Indexed Attention Model (SIAM), which is able to simultaneously learn both the appearance and dynamics of age from raw videos of facial expressions. Specifically, we employ convolutional neural networks to extract effective latent appearance representations and feed them into recurrent networks to model the temporal dynamics. More importantly, we propose to leverage attention models for salience detection in both the spatial domain for each single image and the temporal domain for the whole video as well. We design a specific spatially-indexed attention mechanism among the convolutional layers to extract the salient facial regions in each individual image, and a temporal attention layer to assign attention weights to each frame. This two-pronged approach not only improves the performance by allowing the model to focus on informative frames and facial areas, but it also offers an interpretable correspondence between the spatial facial regions as well as temporal frames, and the task of age estimation. We demonstrate the strong performance of our model in experiments on a large, gender-balanced database with 400 subjects with ages spanning from 8 to 76 years. Experiments reveal that our model exhibits significant superiority over the state-of-the-art methods given sufficient training data.

中文翻译：

参加从面部表情视频进行年龄估计的端到端架构。

从面部表情视频进行年龄估计的主要挑战不仅在于静态面部外观的建模，还在于捕捉面部表情的时间动态。解决此问题的传统技术侧重于构建手工特征，以分别探索面部外观和动态中包含的判别信息。这依赖于复杂的功能细化和框架设计。在本文中，我们提出了一种用于年龄估计的端到端架构，称为空间索引注意力模型（SIAM），它能够从面部表情的原始视频中同时学习年龄的外观和动态。具体来说，我们采用卷积神经网络来提取有效的潜在外观表示，并将其输入到循环网络中以对时间动态进行建模。更重要的是，我们建议利用注意力模型在每个单个图像的空间域和整个视频的时间域中进行显着性检测。我们在卷积层中设计了一种特定的空间索引注意力机制，以提取每个单独图像中的显着面部区域，并设计了一个时间注意力层，为每个帧分配注意力权重。这种双管齐下的方法不仅通过允许模型专注于信息帧和面部区域来提高性能，而且还提供了空间面部区域和时间帧与年龄估计任务之间的可解释的对应关系。我们在一个大型、性别平衡的数据库中进行实验，展示了我们的模型的强大性能，该数据库包含 400 名年龄从 8 岁到 76 岁的受试者。实验表明，在足够的训练数据下，我们的模型比最先进的方法表现出显着的优越性。

更新日期：2020-04-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11