Dense 3D face alignment from 2D video for real-time use*
Graphical Abstract
Introduction
Face alignment is the problem of automatically locating detailed facial landmarks across different subjects, illuminations, and viewpoints. Previous methods can be divided into two broad categories. 2D-based methods locate a relatively small number of 2D fiducial points in real time while 3D-based methods fit a high-resolution 3D model offline at a much higher computational cost and usually require manual initialization. 2D-based approaches include Active Appearance Models [1], [2], Constrained Local Models [3], [4] and shape-regression-based methods [5], [6], [7], [8], [9]). These approaches train a set of 2D models, each of which is intended to cope with shape or appearance variation within a small range of viewpoints. In contrast, 3D-based methods [10], [11], [12], [13] accommodate wide range of views using a single 3D model. Recent 2D approaches enable person-independent initialization, which is not possible with 3D approaches. 3D approaches have advantage with respect to representational power and robustness to illumination and pose but are not feasible for generic fitting and real-time use.
Seminal work by Blanz and Vetter [10] on 3D morphable models minimized intensity difference between synthesized and source-video images. Dimitrijevic et al. [11] proposed a 3D morphable model similar to that of Blanz that discarded the texture component in order to reduce sensitivity to illumination. Zhang et al. [12] proposed an approach that deforms a 3D mesh model so that the 3D corner points reconstructed from a stereo pair lie on the surface of the model. Both [12], [11] minimize shape differences instead of intensity differences, but rely on stereo correspondence. Single view face reconstruction methods [14], [15] produce a detailed 3D representation, but do not estimate the deformations over time. Recently, Suwajanakorn et al. [16] proposed a 3D flow based approach coupled with shape from shading to reconstruct a time-varying detailed 3D shape of a person's face from a video. Gu and Kanade [13] developed an approach for aligning a 3D deformable model to a single face image. The model consists of a set of sparse 3D points and the view-based patches associated with every point. These and other 3D-based methods require precise initialization, which typically involves manual labeling of the fiduciary landmark points. The gain with 3D-based approaches is their far greater representational power that is robust to illumination and viewpoint variation that would scuttle 2D-based approaches.
A key advantage of 2D-based approaches is their much lower computational cost and more recently the ability to forgo manual initialization. In the last few years in particular, 2D face alignment has reached a mature state with the emergence of discriminative shape regression methods [5], [6], [7], [8], [9], [17], [18], [19], [20], [21], [22], [23], [24]. These techniques predict a face shape in a cascade manner: They begin with an initial guess about shape and then progressively refine that guess by regressing a shape increment step-by-step from a feature space. The feature space can be either hand designed, such as SIFT features [7], or learned from data [6], [8], [9].
Most previous work has emphasized 2D face tracking and registration. Relatively neglected is the application of cascade regression in dense 3D face alignment. Only recently did Cao et al. [24] propose a method for regressing facial landmarks from 2D video. Pose and facial expression are recovered by fitting a user-specific blendshape model to them. This method then was extended to a person-independent case [25], where the estimated 2D landmarks were used to adapt the camera matrix and user identity to better match facial expression. Because this approach uses both 2D and 3D annotations, a correction step is needed to resolve inconsistency in the landmark positions across different poses and self-occlusions.
Our approach exploits 3D cascade regression, where the facial landmarks are consistent across all poses. To avoid inconsistency in landmark positions encountered by Cao et al., the face is annotated completely in 3D by selecting a dense set of 3D points (shape). Binary feature descriptors (appearance) associated with a sparse subset of the landmarks are used to regress projections of 3D points. The method first estimates the location of a dense set of landmarks and their visibility, then reconstructs face shapes by fitting a part-based 3D model. The method was made possible in part by training on the BU-4DFE [26] and BP-4D-Spontaneous [27] datasets that contain over 300,000 high-resolution 3D face scans. Because the algorithm makes no assumptions about illumination or surface properties, it can be applied to a wide range of imaging conditions. The method was validated in a series of tests. We found that 3D registration from 2D video effectively handles previously unseen faces with a variety of poses and illuminations. See Fig. 1 for an overview of the system.
This paper advances two main novelties:
Dense cascade-regression-based face alignment
Previous work on cascade-regression-based face alignment was limited to a small number of fiducial landmarks. We achieve a dense alignment with a manageable model size. We show that this is achievable by using a relatively small number of sparse measurements and a compressed representation of landmark displacement-updates. Furthermore, the facial landmarks are always consistent across pose, eliminating the discrepancies between 2D and 3D annotations that have plagued previous approaches.
Real-time 3D part-based deformable model fitting
By using dense cascade regression, we fit a 3D, part-based deformable model to the landmarks. The algorithm iteratively refines the 3D shape and the 3D pose until convergence. We utilize measurements over multiple frames to refine the rigid 3D shape.
The paper is organized as follows: Section 2 details the dense 3D model building process and Section 3 describes the model fitting method in details. The efficiency of our novel solution method is illustrated by numerical experiments in Section 4. Conclusions are drawn in Section 5.
Vectors (a) and matrices (A) are denoted by bold letters. An u ∈ℝd vector's Euclidean norm is . denotes the concatenation of matrices .
Section snippets
Dense face model building
In this section we detail the components of the dense 3D face model building process.
Model fitting
In this section we describe the dense cascade regression and the 3D model fitting process.
Experiments
We conducted a battery of experiments to evaluate the precision of 3D reconstruction and extensions to multi-view reconstruction. Studies concern (i) feature spaces, (ii) optimal model density, (iii) number of measurements in single- and (iv) multi-view scenario, (v) temporal integration and (vi) the performance of 3D head pose estimation under various illumination conditions.
Discussion and conclusions
Faces move, yet most approaches to face alignment use a 2D representation that effectively assumes the face is planar. For frontal or nearly frontal views, this fiction works reasonably well. As rotation from frontal view increases, however, approaches that assume 2D representation begin to fail. Action unit detection becomes less accurate after about 15 to 20 degrees rotation from frontal [52] and expression transfer to a near-photo-realistic avatar fails when the source face rotates beyond
Acknowledgments
Preparation of this publication was supported in part by the National Institute of Mental Health of the National Institutes of Health under Award Number MH096951, Army Research Laboratory Collaborative Technology Alliance Program under cooperative agreement W911NF-10-2-0016, and the Carnegie Mellon University People Image Analysis Consortium. The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsors.
References (55)
- et al.
Automatic feature localisation with constrained local models
Pattern Recogn.
(2008) - et al.
Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment
- et al.
BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database
Image Vis. Comput.
(2014) - et al.
Active appearance models
IEEE Trans. Pattern Anal. Mach. Intell.
(2001) - et al.
Active appearance models revisited
Int. J. Comput. Vis.
(2004) - et al.
Deformable model fitting by regularized landmark mean-shift
Int. J. Comput. Vis.
(2011) - et al.
Cascaded pose regression
- et al.
Face alignment by explicit shape regression
- et al.
Supervised Descent Method and Its Applications to Face Alignment
- et al.
Robust Face Landmark Estimation Under Occlusion
Face Alignment at 3000 FPS via Regressing Local Binary Features
A Morphable Model for the Synthesis of 3D Faces
Accurate face models from uncalibrated and ill-lit video sequences
Robust and Rapid Generation of Animated Faces from Video Images: A Model-Based Modeling Approach
Int. J. Comput. Vis.
3D Alignment of Face in a Single Image
3D Face Reconstruction from a single image using a single reference face shape
IEEE Trans. Pattern Anal. Mach. Intell.
Viewing real-world faces in 3D
Total Moving Face Reconstruction
Real-time facial feature detection using conditional regression forests
Deep Convolutional Network Cascade for Facial Point Detection
Facial point detection using boosted regression and graph models
Local evidence aggregation for regression-based facial point detection
IEEE Trans. Pattern Anal. Mach. Intell.
One Millisecond Face Alignment with an Ensemble of Regression Trees
Incremental Face Alignment in the Wild
3D Shape regression for real-time facial animation
ACM Trans. Graph.
Displaced dynamic expression regression for real-time facial tracking and animation
ACM Trans. Graph.
A high-resolution 3D dynamic facial expression database
Cited by (80)
Dual-position features fusion for head pose estimation for complex scene
2022, OptikCitation Excerpt :created a new database for HPE and developed an automatic facial annotation program. [54] proposed a teleoperating system combining facial emotion expression and head movements. [55] utilized a fast cascade regression framework to estimate the location of landmarks and fitted a 3D model to reconstruct the face shape.
Head pose estimation: A survey of the last ten years
2021, Signal Processing: Image Communication
- *
This paper has been recommended for acceptance by Vitomir ?truc.