Abstract

How to accurately reconstruct the 3D model human face is a challenge issue in the computer vision. Due to the complexity of face reconstruction and diversity of face features, most existing methods are aimed at reconstructing a smooth face model with ignoring face details. In this paper a novel deep learning-based face reconstruction method is proposed. It contains two modules: initial face reconstruction and face details synthesis. In the initial face reconstruction module, a neural network is used to detect the facial feature points and the angle of the pose face, and 3D Morphable Model (3DMM) is used to reconstruct the rough shape of the face model. In the face detail synthesis module, Conditional Generation Adversarial Network (CGAN) is used to synthesize the displacement map. The map provides texture features to render to the face surface reconstruction, so as to reflect the face details. Our proposal is evaluated by Facescape dataset in experiments and achieved better performance than other current methods.

1. Introduction

Face is one of the most important biological characteristics of human beings, and face modeling is often used in security, animation, biometrics, and other fields [1, 2]. In recent years, due to the limitations of 2D images, the research of human face has gradually shifted from 2D plane images to 3D space models.

The steps of 3D face reconstruction are very complex if it is reconstructed step by step. Moreover, this reconstruction model will lead to more data loss and less accuracy. To stress this issue, one-step reconstruction model is presented (see Figure 1). The reconstruction system is divided into two parts: the initial face reconstruction module and the face detail synthesis module, and both are based on deep learning [3]. The initial face reconstruction module is mainly responsible for face alignment. The supervised learning method is used to train 60K face images from 300W-LP dataset to obtain the corresponding dictionary. In this process, a CNN network is used to align the negative faces and detect their feature points. The feature points are input into the principal component analysis- (PCA-) based 3DMM [4] to obtain a rough face shape. The face detail synthesis module is based on CGAN, which inputs the original image to synthesize the displacement map, and the displacement map retains the more complete details of the face [5]. The face detail synthesis module refers to DFDN to train high-quality images and get the training data, which can synthesize the displacement map from the original image.

In this paper, we propose a reconstruction system to recover the details of the face model. Our reconstruction system can better solve the problem of face pose reconstruction and facial expression reconstruction from the input image. The facial detail synthesis module of the reconstruction system can extract facial features from the input image and synthesize the displacement map containing most of the details of the target face. Compared with the initial shape model, the detail face model with displacement map has better visual effect and more accurate data.

The rest of the paper is organized as follows. Section 2 describes the researchers’ related work on 3D face. Section 3 describes the initial face reconstruction module. Section 4 describes the face detail synthesis module. Section 5 is the experiment and analysis. Section 6 concludes the paper.

With the application of deep learning methods to graphics, the transition from 2D planar images to 3D spatial models has become one of the popular research directions. Blanz et al. proposed the concept of 3DMM and obtained a Basel Face Model (BFM) by training the objects and related data collected by the depth camera [6]. The parameterized BFM has the universal characteristics of a human face, and a deformed 3D model can be obtained by inputting shape, texture, and attribute parameters. A large number of 3DMM-based algorithms have been proposed. Tran et al. proposed a nonlinear 3D face deformation model method [7], which used a large number of unconstrained pictures as training objects to train a new architecture of 3DMM without using 3D scanning equipment. Galteri et al. used CGAN to refine 3DMM [8].

In addition to traditional 3DMM, an end-to-end method based on deep learning can also better reconstruct 3D face models. The end-to-end method can perform face alignment on the input face image. In the vector space, the detected feature points are mapped to the face model of the dense point cloud one by one. This method is simple and fast. Compared with the traditional 3DMM, its accuracy is higher in most cases. Yao et al. designed PRNet [9] based on the CNN network structure and deep residual network and used the UV vector space to complete the mapping of the 3D face model. Jackson et al. proposed a combination of 3DMM and CNN VRN [10] to reconstruct the model of nonfrontal face images. Tran et al. used the end-to-end neural network to reconstruct the details of extreme face [11].

The rendering of face model is also a key part. Ranjan et al. proposed COMA method to generate the head network and used an MPI-IS Mesh Processing Library for rendering [12]. MPI-IS Mesh Processing Library is an efficient 3D model rendering tool. Li et al. designed a Flame model to render the basic shape and expression of the face model [13]. Sanyal et al. proposed RingNet based on Flame [14]. RingNet can reconstruct the head model by inputting face image and can better simulate the facial expression. A deep 3D face reconstruction method was proposed by Deng et al. [15]. This method is based on 3DMM and coarse facial expression [16], and the rendered model is more accurate.

3. Initial Face Reconstruction Module

The initial face reconstruction module is the key module in proposed reconstruction system. This module outputs the input face image directly to the rough initial face model, which includes pose face alignment, feature point detection, and model fitting.

3.1. Construction of Rough Face Model
3.1.1. Face Alignment

In our method, the feature point coordinates are used as the input of 3DMM based on the PCA algorithm to construct a parameterized model.

Because the manual labeling is time-consuming and labor-intensive, and the traditional feature point detection is poor in robustness and accuracy, we use a CNN to deal with the face alignment of nonfrontal face images. This article uses the DLIB library to detect feature points. DLIB library uses regression tree set cascade [17] to generate feature point model through supervised learning and training image sets with feature point annotations. Input an image, the algorithm will generate the initial shape based on the target face and roughly estimate the location of the feature points. Then, a gradient boosting algorithm is used to reduce the error between the initial shape and the real landmark, and the least square method is used to minimize the error to obtain the cascade of each stage. where is the number of cascade regressions, is the shape vector of the th secondary cascade regression, and is the input image. The key point of the cascade is that the regressor predicts according to the image pixel intensity value and indexes it relative to the current shape vector . The feature points of a nonfrontal face image are divided into two parts: visible and invisible. Since the latter is difficult to predict, deep learning methods can effectively deal with this problem.

We train 60K face images with face deflection angle data and feature point coordinate data in the 300W-LP dataset [18] to obtain a dictionary. Through the index dictionary, the output finds the index target that is closest to the deflection angle of the input face image. In addition, referring to the weight setting of the main components of the human face by referring to the PRNet, the feature points in the vicinity of the eyes, nose, and mouth are given greater weights to highlight the changes and recognition of the model where detects the coordinates of face feature points and is the weight. Figure 2 shows an example of feature point detection.

3.1.2. 3DMM Face Reconstruction

A rough face model with a smooth surface is relatively average, without too much facial detail but contains most of the depth information of the face. Inputting the face image fitting model will change the vertex, and the topological network of the BFM will average face model. The method in this paper employ BFM2017 [19] to fit a 3D face with less detail.

Taking the original image as input, assuming that the grid vertex coordinates of the 3D model are . The feature points according to Equation (2) are used to calculate the PCA parameters. According to [6], shape vectors of the initial face model is where is the shape weight coefficient.

According to the average face shape obtained from the training set of 200 images, the difference between the shape of each face model and the average face shape calculates the covariance matrix of the shape vector. Through PCA, the orthogonal coordinate system formed by the eigenvector of is transformed into the basis:

Due to the universality of main features of human face, the distribution of shape vector parameter is normal distribution (as shown in Equation (5)). Texture parameters are similar to shape parameter.

For the shape parameter , texture parameter , and attribute parameter , the RGB vector of the projected image of the reconstruction model is

The error between the projected image of the reconstructed model and the input image is

Matching the input face image with the 3D modeled face is an ill-posed problem. In the vector space of the face model, the matching quality and a priori method can be used to obtain the solution with constraints [6]. Similar to Equation (5), and obey normal distribution, is obtained by the point-to-point method. According to Bayesian decision, the input image can be obtained through the maximum posterior probability with the parameters , and the model is reconstructed through the three parameters. But under the influence of noise, the observed image will be disturbed.

Assuming the standard deviation of the Gaussian noise of the observed image, the parameter probability of the observed image is

The posterior probability of the parameter is expressed by minimizing the cost function:

3.2. Camera Model
3.2.1. Weak Perspective Projection Function

To visualize a 3D model, the topology of the 3D model needs to be projected onto a two-dimensional plane. Compared with orthogonal projection, perspective projection can freely set the reduction and enlargement of the projected image.

During the projection process, it may appear that the dense 3D coordinates are superimposed on the 2D coordinate points of the projection surface due to dimensionality reduction. Aiming at the projection of the pose face model, this paper uses a weak perspective projection function similar to the perspective projection function to deal with the problem of projecting a 3D model onto a 2D plane [20]. Figure 3 explains the difference between orthographic projection and weak perspective projection.

In this paper, assuming that the positive direction of the camera model is the weak perspective projection in the direction, referring to [21], we use the orthogonal projection matrix and the target displacement calibration to design the weak perspective projection function:

Optimizing Equation (10), where is the focal length ratio, is the rotation matrix, and is the displacement coefficient of the th vertex coordinate. The weak perspective projection function projects the normalized face mesh vertices from the 3D space to the 2D plane, which is convenient for subsequent operations and processing.

Minimize the error of projecting the initial reconstruction model based on the PCA algorithm to the plane [22]: where is the th feature point of the planar face, is the coordinates of the th vertex of the 3D model, is the weight of the th feature point, and is the regularization coefficient of the shape parameter.

3.2.2. Hidden Surfaces Remove

In the dense 3D mesh, under nonfrontal face conditions, some vertices will always overlap, which affects the result and accuracy of feature point acquisition. In this paper, the -buffer algorithm [23] is used to solve the ambiguity of the depth value.

The -buffer algorithm buffers the depth value of the visible surface into the depth buffer area, and the depth value of the hidden surface is removed. So, the single view only has the depth of the visible surface. The depth value is not the true Euclidean distance of the Cartesian space coordinate system, but a relative measure of the distance from the vertex to the viewpoint. Assuming that the model is viewed from the perspective of the -axis as the positive direction, the projection surface is the plane. is the coordinates of each pixel in the overlapping area of the projection surface. The ray parallel to the -axis are the depth values, which are and , respectively, and the maximum of (,) is stored in the -buffer.

Figure 4 demonstrates the difference between whether to use the -buffer depth map. The depth map of the depth map using the -buffer algorithm is distinct, and there is no ambiguity in the depth value due to the posture self-occlusion caused by a single perspective.

3.3. Face Alignment Network

The purpose of the face alignment is to obtain a dictionary through training. The input face image is indexed after face detection, etc., and then, the angle of the target face relative to the frontal perspective can be obtained, and the target face can be aligned [24]. The face alignment obtains the angle of the target face with the feature point detection of DLIB library and the improved feature point loss function. When the input is a face image with a large pose, not only the visible feature points can be accurately detected but also the feature points that are invisible due to the posture self-occlusion can be predicted more accurately.

Test an image in the test set times, and take the average value of its location map feature points. Improved loss function : where is the average value of tests of the feature point landmarks of the location map, is the real landmark, and is the weight of the feature points.

The face alignment network is a CNN architecture based on the residual network [25, 26], composed of 10 residual modules. Figure 5 is a diagram of the face alignment network structure.

When using the face alignment network training and the angle of the training set image corresponding to the annotation, 3D point cloud and additional parameters are used as the training object, and the projection normalized coordinate code (PNCC) feature [18] that can represent the shape of the model is used to generate a dictionary.

PNCC is composed of normalized coordinate code (NCC) and -buffer algorithm. NCC normalizes the coordinates of the vertices of the 3D average face model, and its calculation formula is

The purpose of PNCC is to use -buffer algorithm to remove the hidden surface normalized by NCC to achieve the effect of projection. PNCC calculation formula: where is the 3D surface after projection and is a model parameter.

4. Face Detail Synthesis Module

In the initial face reconstruction module, although the 3DMM reconstruction model based on the PCA algorithm has most of the information of the reconstruction target, it loses part of the detailed information due to dimensionality reduction. We use a face detail synthesis module to make up for face detail information.

4.1. Displacement Map Based on Texture Bump

The details of the face include gullies and wrinkles, so it is difficult to detect and extract them with a unified standard method. Undifferentiated detection and detail extraction integration can effectively solve this problem. We uses the deep learning method to build a detailed synthesis network, which detects the face in the image and extracts the texture map of the face area, and synthesizes a displacement map based on the texture map.

The displacement map is similar to the normal map. Normal map highlights the unevenness of the model. The normal map represents the normal vector corresponding to the vertices, but cannot change the vertex coordinates of the model itself. Since all the details are only reflected in the map, the displacement map can use micropolygon tessellate [27] to change details of the model surface. For a 3DMM composing of triangular meshes, first, inlay a triangular structure with the same size as the image pixel size on the effective area of the model. The bump map is grayed out, and the depth coordinate is determined by the gray level. Then, according to the triangle mesh obtained by mosaic, the vertices are moved along the original surface normal direction. Then, determine the new normal vector for the new mesh vertex.

The lower and of the model’s three-dimensional coordinates are represented by the coordinates of the texture, i.e., the image color. coordinate is represented by the gray scale of the displacement map. The depth information of the shifted texture obtained by graying the texture is incomplete. The reason is that for face images, some face details may be treated as noise, or the depth of some details is too similar to the main area of the face, resulting in a large deviation of the model.

Our method proposes a detailed synthesis network based on the gray-scale displacement map, and the subtle details of the face are used as noise to extract the difficult-to-handle details of the texture map from the generator. The extracted detail noise is used as a feature map and cyclically synthesized to a displacement map. According to the gray value, the depth of the model is changed in a small amount to highlight the details. The pixels of the synthesized texture are , and there are more pixels corresponding to details, which is more convenient for processing. The three images in Figure 6 are the RGB texture map, the normal map, and the displacement map.

In Figure 7, the red frame area of the detail model reconstructed by the method in this paper represents three details from small to large depth. Rendering the displacement map to the model can clearly see that the fit of different degrees of detail is relatively good.

4.2. Facial Expression Process

The recognition and fitting of facial expressions is a key problem that needs to be solved in the field of 3D face reconstruction. The dynamic changes and severity of the face will affect the analysis of the main components of the face. When projecting, because the 3D space dimensionality reduction will lose part of the information, the facial expression model will appear ambiguity when it is projected onto the 2D plane.

Our method mainly uses the expression fitting function of BFM2017 to realize the dynamic changes of the face. According to Equation (3), on the basis of the neutral expression face shape vector, an additional expression vector e is added to it, i.e.,

However, the expression fitting function of BFM2017 mainly changes the mouth vector, and the fitting effect of other face parts is not ideal. Therefore this article uses a semantically defined emotion feature predictor and physical appearance features. The emotion feature predictor is based on deep learning training to obtain the corresponding expression parameters, and the appearance feature is the expression fitting of BFM2017.

Referring to the processing of facial dynamic expressions in DFDN, the emotion feature predictor is trained from a total of 450k images with 11 expressions in the AffectNet dataset [28]. The used to represent the feature vector of human emotion is obtained by the network training of CNN structure, and the emotion parameters are randomly generated in the standard normal distribution. The emotional feature vector with expression parameters is used to render the emotional image set, and the training set is input to the emotional feature predictor to obtain the feature vector of the face object in the image set [22]. The emotional feature vector is combined with the physical appearance feature to obtain a semantically defined feature vector.

According to the one-to-one correspondence between the feature vector of the image set and the expression parameter, a dictionary is set to represent the mapping of the feature vector to the expression parameter. Input a facial expression image, get its emotion feature vector through the emotion feature predictor, traverse the dictionary, and find the expression parameter closest to this vector.

4.3. GAN-Based Detail Synthesis Network

The Conditional Generative Adversarial Network (CGAN) [29] based on GAN is divided into two parts: generator network and discriminator network. The generator network randomly generates constrained images, and the generated images pass through the discriminator to perform feature threshold discrimination, save valid features, and cycle the generation-judgment process until the discriminator cannot determine the wrong image.

In this article, dealing with 3D face models, the loss function of CGAN is as follows: where is the input image, is the feature point, and is random noise. Refer to [30], optimizing Equation (17): where is the generator loss function, is the discriminator loss function, is the generator’s loss function, and is set to 100.

The U-net model based on improved FCN [31] is a structure including down-sampling and up-sampling, with the purpose of increasing the accuracy of the image. Down-sampling is used to display environmental information, and up-sampling combines the environmental information from down-sampling with the input information of up-sampling to restore detailed information, making the texture of the human face more real.

This network uses the U-net-6 structure and takes the original target image as input, to generate displacement maps from the semantically defined texture structure map. The generator network and the 4-layer fully connected layer constrain the generated data through feature points and calculate the PCA parameters. Except for the fully connected layer, every linear part is activated by the ReLU function. The LeakyReLU function is used to activate between the fully connected layers. The structure of the U-net-6 network generator is shown in Figure 8.

The network discriminator judges the validity of the output image through the threshold. In this paper, the discriminator is based on PatchGAN [32]. The input image is divided into an matrix, and after convolution, an matrix is output. The output matrix is averaged, the threshold is judged, and the logical result is output. The network structure of the discriminator is shown in Figure 9.

5. Experiment and Discussion

5.1. Face Alignment Evaluation

The visible and invisible feature points of nonfrontal faces obtained by face alignment will directly affect the subsequent initial face reconstruction. In our evaluation experiment, the normalized mean error (NME) calculated by comparing with real landmarks represents the accuracy of feature points.

For the face alignment experiment, this article uses the 300W-LP dataset as the training set. The dataset contains faces deflection from 0 to 90 degrees, with a total of more than 60K images. Use the DLIB library to detect human faces and crop each image into a face image.

Aiming at the accuracy evaluation of the feature points of face poses at different angles, this paper randomly selects 1000 images from 300W-LP dataset. Calculate the average of the normalized mean error (NME) between the 68 detected feature points of the face and the real landmarks to evaluate the accuracy in this paper. In addition, we compare our method with other two advanced face alignment methods PRNet and 3DDFA. The results obtained are shown in Figure 10.

According to Figure 10, compared with the other two methods, our method can get better results in the feature point detection experiment of 300W-LP sample set.

5.2. Reconstruction Evaluation in Constrained Scenarios

For the evaluation of face image reconstruction in constrained scenes, this experiment uses Facescape dataset [33]. Aiming at the evaluation of the 3D model [34], the evaluation experiment in this paper is based on the root mean square error (RMSE) and standard deviation (SD) between the point cloud of the reconstructed model and ground truth. Among them, RMSE is used to evaluate the accuracy of the reconstructed model, and SD is used to assess the degree of dispersion of the point cloud of the reconstructed model itself. In the reconstruction evaluation, the accuracy values of neutral face evaluation, facial expression evaluation, and robustness evaluation are represented by RMSE 1, RMSE 2, and RMSE 3, respectively. The lower the RMSE and SD, the better the accuracy and dispersion of the reconstruction model.

5.2.1. Frontal Face Model Evaluation

In this experiment, the accuracy (RMSE 1) and the discrete value (SD) of the frontal face reconstruction model are used as the evaluation standard. In this evaluation process, 10 frontal face images of the subject were randomly selected from Facescape dataset as test set 1, and the test set 1 images were reconstructed through the integrated network proposed in this paper, and 10 sets of models were obtained. In addition, this experiment compares our method with three other advanced algorithms, PRNet [9], 3DDFA [18], and RingNet [14].

According to the evaluation standard, the mean RMSE 1 and SD of the 10 groups of reconstruction models are calculated. The data of the test set 1 are shown in Table 1. The detailed data of our method and the model reconstructed by PRNet, 3DDFA, and RingNet are shown in Figure 11.

Based on the above data comparison, our method has higher accuracy and dispersion in reconstructing the frontal neutral face image compared to the other three methods. Figure 12 shows examples of the heat distribution of the sample reconstruction model error.

5.2.2. Frontal Face Model with Expression Evaluation

The difficulty of facial expression reconstruction is often greater than that of neutral expression face reconstruction. We show more reconstruction models of images in unconstrained environment in Figure 13.

In this experiment, Facescape dataset was used to evaluate the reconstruction of facial expressions. Facescape dataset contains the depth information data of 20 facial dynamic expressions of each collected object. Eight dynamic facial expression images of the object are randomly selected from Facescape dataset for reconstruction, and the root mean square error (RMSE2) is calculated (as shown in Figure 14).

In the same method, the accuracy of facial expression reconstruction model is often slightly lower than that of neutral face reconstruction model. In Figure 14, although the RMSE 2 of face models with expression reconstructed by our method is higher than the mean RMSE 1, it is lower than the mean RMSE 1 of the face model with neutral reconstructed by other methods. The accuracy of our method for facial expression reconstruction is significantly higher than that of neutral face reconstruction model of other comparison methods, so our method also has higher advantages in facial expression fitting.

5.2.3. Robustness Evaluation under Noise Environment

In the field of 3D reconstruction, robustness is an important evaluation criterion for reconstruction model algorithms. It can clearly indicate the degree of adaptation of the algorithm in a complex environment and whether it can reduce the influence of interference factors on model reconstruction. The robustness evaluation in this paper is mainly about face reconstruction under noisy environment. First, randomly select 6 images from the Facescape dataset, and apply Gaussian noise and salt and pepper noise to these 6 images, respectively. As can be seen in Figure 15, an example compares the difference between detail reconstruction model of the original image and detail reconstruction model of the noise image.

The image after applying noise is the test set 2. Then, the original image and the noise image of the test set 2 were reconstructed through the integrated network, compared with the ground truth, and the root mean square error (RMSE 3) was calculated (as shown in Figure 16).

According to the test set 2 of the noise evaluation experiment and the corresponding noise image, the fluctuation interval of the RMSE3 of the noise image reconstruction and the original image reconstruction is (-0.04, 0.18). In addition, there may be a large number of noise points covering the high-frequency details, which will affect the discriminating process of the discriminator of the face detail synthesis module, resulting in the increase of iterations and the slight improvement of the accuracy of the whole model.

6. Conclusion

We propose a reconstruction system for face model. The initial face reconstruction module uses a face alignment network and 3DMM to initially reconstruct a face with a smooth surface. The face detail synthesis network generates a displacement map, which contains most of the details of the reconstructed object. For facial expressions, we use an emotional feature predictor to fit facial expressions. The three-dimensional sense and accuracy of the detailed face model are better than the 3DMM reconstruction model based on PCA. Through the evaluation of face alignment, accuracy, and robustness in unconstrained scenes, our method obtains ideal results. Compared with other advanced methods, our method also has more advantages.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the Shandong Provincial Natural Science Foundation (ZR2020MF119).