Elsevier

Image and Vision Computing

Volume 58, February 2017, Pages 13-24
Image and Vision Computing

Dense 3D face alignment from 2D video for real-time use*

https://doi.org/10.1016/j.imavis.2016.05.009Get rights and content

Highlights

  • 3D cascade regression approach is proposed in which facial landmarks remain invariant.

  • From a single 2D image of a person's face, a dense 3D shape is registered in real time for each frame.

  • Multi-view reconstruction and temporal integration for videos are presented.

  • Method is robust for 3D head-pose estimation under various conditions.

Abstract

To enable real-time, person-independent 3D registration from 2D video, we developed a 3D cascade regression approach in which facial landmarks remain invariant across pose over a range of approximately 60°. From a single 2D image of a person's face, a dense 3D shape is registered in real time for each frame. The algorithm utilizes a fast cascade regression framework trained on high-resolution 3D face-scans of posed and spontaneous emotion expression. The algorithm first estimates the location of a dense set of landmarks and their visibility, then reconstructs face shapes by fitting a part-based 3D model. Because no assumptions are required about illumination or surface properties, the method can be applied to a wide range of imaging conditions that include 2D video and uncalibrated multi-view video. The method has been validated in a battery of experiments that evaluate its precision of 3D reconstruction, extension to multi-view reconstruction, temporal integration for videos and 3D head-pose estimation. Experimental findings strongly support the validity of real-time, 3D registration and reconstruction from 2D video. The software is available online at http://zface.org.

Introduction

Face alignment is the problem of automatically locating detailed facial landmarks across different subjects, illuminations, and viewpoints. Previous methods can be divided into two broad categories. 2D-based methods locate a relatively small number of 2D fiducial points in real time while 3D-based methods fit a high-resolution 3D model offline at a much higher computational cost and usually require manual initialization. 2D-based approaches include Active Appearance Models [1], [2], Constrained Local Models [3], [4] and shape-regression-based methods [5], [6], [7], [8], [9]). These approaches train a set of 2D models, each of which is intended to cope with shape or appearance variation within a small range of viewpoints. In contrast, 3D-based methods [10], [11], [12], [13] accommodate wide range of views using a single 3D model. Recent 2D approaches enable person-independent initialization, which is not possible with 3D approaches. 3D approaches have advantage with respect to representational power and robustness to illumination and pose but are not feasible for generic fitting and real-time use.

Seminal work by Blanz and Vetter [10] on 3D morphable models minimized intensity difference between synthesized and source-video images. Dimitrijevic et al. [11] proposed a 3D morphable model similar to that of Blanz that discarded the texture component in order to reduce sensitivity to illumination. Zhang et al. [12] proposed an approach that deforms a 3D mesh model so that the 3D corner points reconstructed from a stereo pair lie on the surface of the model. Both [12], [11] minimize shape differences instead of intensity differences, but rely on stereo correspondence. Single view face reconstruction methods [14], [15] produce a detailed 3D representation, but do not estimate the deformations over time. Recently, Suwajanakorn et al. [16] proposed a 3D flow based approach coupled with shape from shading to reconstruct a time-varying detailed 3D shape of a person's face from a video. Gu and Kanade [13] developed an approach for aligning a 3D deformable model to a single face image. The model consists of a set of sparse 3D points and the view-based patches associated with every point. These and other 3D-based methods require precise initialization, which typically involves manual labeling of the fiduciary landmark points. The gain with 3D-based approaches is their far greater representational power that is robust to illumination and viewpoint variation that would scuttle 2D-based approaches.

A key advantage of 2D-based approaches is their much lower computational cost and more recently the ability to forgo manual initialization. In the last few years in particular, 2D face alignment has reached a mature state with the emergence of discriminative shape regression methods [5], [6], [7], [8], [9], [17], [18], [19], [20], [21], [22], [23], [24]. These techniques predict a face shape in a cascade manner: They begin with an initial guess about shape and then progressively refine that guess by regressing a shape increment step-by-step from a feature space. The feature space can be either hand designed, such as SIFT features [7], or learned from data [6], [8], [9].

Most previous work has emphasized 2D face tracking and registration. Relatively neglected is the application of cascade regression in dense 3D face alignment. Only recently did Cao et al. [24] propose a method for regressing facial landmarks from 2D video. Pose and facial expression are recovered by fitting a user-specific blendshape model to them. This method then was extended to a person-independent case [25], where the estimated 2D landmarks were used to adapt the camera matrix and user identity to better match facial expression. Because this approach uses both 2D and 3D annotations, a correction step is needed to resolve inconsistency in the landmark positions across different poses and self-occlusions.

Our approach exploits 3D cascade regression, where the facial landmarks are consistent across all poses. To avoid inconsistency in landmark positions encountered by Cao et al., the face is annotated completely in 3D by selecting a dense set of 3D points (shape). Binary feature descriptors (appearance) associated with a sparse subset of the landmarks are used to regress projections of 3D points. The method first estimates the location of a dense set of landmarks and their visibility, then reconstructs face shapes by fitting a part-based 3D model. The method was made possible in part by training on the BU-4DFE [26] and BP-4D-Spontaneous [27] datasets that contain over 300,000 high-resolution 3D face scans. Because the algorithm makes no assumptions about illumination or surface properties, it can be applied to a wide range of imaging conditions. The method was validated in a series of tests. We found that 3D registration from 2D video effectively handles previously unseen faces with a variety of poses and illuminations. See Fig. 1 for an overview of the system.

This paper advances two main novelties:

  • Dense cascade-regression-based face alignment

    Previous work on cascade-regression-based face alignment was limited to a small number of fiducial landmarks. We achieve a dense alignment with a manageable model size. We show that this is achievable by using a relatively small number of sparse measurements and a compressed representation of landmark displacement-updates. Furthermore, the facial landmarks are always consistent across pose, eliminating the discrepancies between 2D and 3D annotations that have plagued previous approaches.

  • Real-time 3D part-based deformable model fitting

    By using dense cascade regression, we fit a 3D, part-based deformable model to the landmarks. The algorithm iteratively refines the 3D shape and the 3D pose until convergence. We utilize measurements over multiple frames to refine the rigid 3D shape.

The paper is organized as follows: Section 2 details the dense 3D model building process and Section 3 describes the model fitting method in details. The efficiency of our novel solution method is illustrated by numerical experiments in Section 4. Conclusions are drawn in Section 5.

Vectors (a) and matrices (A) are denoted by bold letters. An u ∈ℝd vector's Euclidean norm is u2=i=1dui2. B=[A1;;AK]R(d1++dK)×N denotes the concatenation of matrices AkRdk×N.

Section snippets

Dense face model building

In this section we detail the components of the dense 3D face model building process.

Model fitting

In this section we describe the dense cascade regression and the 3D model fitting process.

Experiments

We conducted a battery of experiments to evaluate the precision of 3D reconstruction and extensions to multi-view reconstruction. Studies concern (i) feature spaces, (ii) optimal model density, (iii) number of measurements in single- and (iv) multi-view scenario, (v) temporal integration and (vi) the performance of 3D head pose estimation under various illumination conditions.

Discussion and conclusions

Faces move, yet most approaches to face alignment use a 2D representation that effectively assumes the face is planar. For frontal or nearly frontal views, this fiction works reasonably well. As rotation from frontal view increases, however, approaches that assume 2D representation begin to fail. Action unit detection becomes less accurate after about 15 to 20 degrees rotation from frontal [52] and expression transfer to a near-photo-realistic avatar fails when the source face rotates beyond

Acknowledgments

Preparation of this publication was supported in part by the National Institute of Mental Health of the National Institutes of Health under Award Number MH096951, Army Research Laboratory Collaborative Technology Alliance Program under cooperative agreement W911NF-10-2-0016, and the Carnegie Mellon University People Image Analysis Consortium. The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsors.

References (55)

  • S. Ren et al.

    Face Alignment at 3000 FPS via Regressing Local Binary Features

  • V. Blanz et al.

    A Morphable Model for the Synthesis of 3D Faces

  • M. Dimitrijevic et al.

    Accurate face models from uncalibrated and ill-lit video sequences

  • Z. Zhang et al.

    Robust and Rapid Generation of Animated Faces from Video Images: A Model-Based Modeling Approach

    Int. J. Comput. Vis.

    (2004)
  • L. Gu et al.

    3D Alignment of Face in a Single Image

  • I. Kemelmacher-Shlizerman et al.

    3D Face Reconstruction from a single image using a single reference face shape

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • T. Hassner

    Viewing real-world faces in 3D

  • S. Suwajanakorn et al.

    Total Moving Face Reconstruction

  • M. Dantone et al.

    Real-time facial feature detection using conditional regression forests

  • Y. Sun et al.

    Deep Convolutional Network Cascade for Facial Point Detection

  • M. Valstar et al.

    Facial point detection using boosted regression and graph models

  • B. Martinez et al.

    Local evidence aggregation for regression-based facial point detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • V. Kazemi et al.

    One Millisecond Face Alignment with an Ensemble of Regression Trees

  • A. Asthana et al.

    Incremental Face Alignment in the Wild

  • C. Cao et al.

    3D Shape regression for real-time facial animation

    ACM Trans. Graph.

    (2013)
  • C. Cao et al.

    Displaced dynamic expression regression for real-time facial tracking and animation

    ACM Trans. Graph.

    (2014)
  • L. Yin et al.

    A high-resolution 3D dynamic facial expression database

  • Cited by (80)

    • Dual-position features fusion for head pose estimation for complex scene

      2022, Optik
      Citation Excerpt :

      created a new database for HPE and developed an automatic facial annotation program. [54] proposed a teleoperating system combining facial emotion expression and head movements. [55] utilized a fast cascade regression framework to estimate the location of landmarks and fitted a 3D model to reconstruct the face shape.

    • Head pose estimation: A survey of the last ten years

      2021, Signal Processing: Image Communication
    View all citing articles on Scopus
    *

    This paper has been recommended for acceptance by Vitomir ?truc.

    View full text