Listen to Your Face: Inferring Facial Action Units from Audio Channel,IEEE Transactions on Affective Computing

当前位置： X-MOL 学术 › IEEE Trans. Affect. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Listen to Your Face: Inferring Facial Action Units from Audio Channel
IEEE Transactions on Affective Computing ( IF 11.2 ) Pub Date : 2019-10-01 , DOI: 10.1109/taffc.2017.2749299
Zibo Meng , Shizhong Han , Yan Tong

Extensive efforts have been devoted to recognizing facial action units (AUs). However, it is still challenging to recognize AUs from spontaneous facial displays especially when they are accompanied by speech. Different from all prior work that utilized visual observations for facial AU recognition, this paper presents a novel approach that recognizes speech-related AUs exclusively from audio signals based on the fact that facial activities are highly correlated with voice during speech. Specifically, dynamic and physiological relationships between AUs and phonemes are modeled through a continuous time Bayesian network (CTBN); then AU recognition is performed by probabilistic inference via the CTBN model. A pilot audiovisual AU-coded database has been constructed to evaluate the proposed audio-based AU recognition framework. The database consists of a “clean” subset with frontal and neutral faces and a challenging subset collected with large head movements and occlusions. Experimental results on this database show that the proposed CTBN model achieves promising recognition performance for 7 speech-related AUs and outperforms both the state-of-the-art visual-based and audio-based methods especially for those AUs that are activated at low intensities or “hardly visible” in the visual channel. The improvement is more impressive on the challenging subset, where the visual-based approaches suffer significantly.

中文翻译：

听你的脸：从音频通道推断面部动作单元

已经进行了广泛的努力来识别面部动作单元（AU）。然而，从自发的面部显示中识别 AU 仍然具有挑战性，尤其是当它们伴随着语音时。与之前所有利用视觉观察进行面部 AU 识别的工作不同，本文提出了一种新方法，该方法基于面部活动与语音期间的语音高度相关的事实，仅从音频信号中识别与语音相关的 AU。具体来说，AU 和音素之间的动态和生理关系是通过连续时间贝叶斯网络 (CTBN) 建模的；然后通过 CTBN 模型通过概率推理执行 AU 识别。已经构建了一个试点视听 AU 编码数据库来评估所提出的基于音频的 AU 识别框架。该数据库由具有正面和中性面部的“干净”子集和具有较大头部运动和遮挡的具有挑战性的子集组成。该数据库的实验结果表明，所提出的 CTBN 模型对 7 个与语音相关的 AU 实现了有希望的识别性能，并且优于最先进的基于视觉和基于音频的方法，尤其是对于那些在低强度下激活的 AU或在视觉通道中“几乎不可见”。在具有挑战性的子集上，改进更令人印象深刻，其中基于视觉的方法受到严重影响。该数据库的实验结果表明，所提出的 CTBN 模型对 7 个与语音相关的 AU 实现了有希望的识别性能，并且优于最先进的基于视觉和基于音频的方法，尤其是对于那些在低强度下激活的 AU或在视觉通道中“几乎不可见”。在具有挑战性的子集上，改进更令人印象深刻，其中基于视觉的方法受到严重影响。该数据库的实验结果表明，所提出的 CTBN 模型对 7 个与语音相关的 AU 实现了有希望的识别性能，并且优于最先进的基于视觉和基于音频的方法，尤其是对于那些在低强度下激活的 AU或在视觉通道中“几乎不可见”。在具有挑战性的子集上，改进更令人印象深刻，其中基于视觉的方法受到严重影响。

更新日期：2019-10-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>