Elsevier

Neural Networks

Volume 134, February 2021, Pages 86-94
Neural Networks

2020 Special Issue
Generating photo-realistic training data to improve face recognition accuracy

https://doi.org/10.1016/j.neunet.2020.11.008Get rights and content

Highlights

  • Generate photo-realistic face images using a conditional GAN.

  • Two latent vectors encode identity-related and non-related attributes respectively.

  • Map discrete identity labels to identity features in a continuous latent space.

  • Training sets with better balance between real and synthetic images outperformed.

Abstract

Face recognition has become a widely adopted biometric in forensics, security and law enforcement thanks to the high accuracy achieved by systems based on convolutional neural networks (CNNs). However, to achieve good performance, CNNs need to be trained with very large datasets which are not always available. In this paper we investigate the feasibility of using synthetic data to augment face datasets. In particular, we propose a novel generative adversarial network (GAN) that can disentangle identity-related attributes from non-identity-related attributes. This is done by training an embedding network that maps discrete identity labels to an identity latent space that follows a simple prior distribution, and training a GAN conditioned on samples from that distribution. A main novelty of our approach is the ability to generate both synthetic images of subjects in the training set and synthetic images of new subjects not in the training set, both of which we use to augment face datasets. By using recent advances in GAN training, we show that the synthetic images generated by our model are photo-realistic, and that training with datasets augmented with those images can lead to increased recognition accuracy. Experimental results show that our method is more effective when augmenting small datasets. In particular, an absolute accuracy improvement of 8.42% was achieved when augmenting a dataset of less than 60k facial images.

Introduction

Recent progress in machine learning has made possible the development of face recognition systems that can match face images as good as or better than humans. For this reason, these systems have become a valuable tool in forensics and security. However, state-of-the-art face recognition systems based on convolutional neural networks (CNNs) need to be trained with very large datasets of face images. In this work we aim to reduce the data requirement of face recognition systems by synthesising artificial face images.

Image synthesis is a widely studied topic in computer vision. In particular, face image synthesis has gained a lot of attention because of its diverse practical applications. These include facial image editing (Antipov et al., 2017, Brock et al., 2017, Choi et al., 2018, Lample et al., 2017, Larsen et al., 2016, Perarnau et al., 2016, Shu et al., 2017, Yan et al., 2016, Zhang et al., 2017), face de-identification (Brkic et al., 2017, Meden et al., 2018, Meden et al., 2017, Wu et al., 2019), data augmentation (Banerjee et al., 2017, Kortylewski et al., 2018, Masi et al., 2019, Masi et al., 2016, Mokhayeri et al., 2018, Osadchy et al., 2017, Zhao et al., 2019), face frontalisation (Hassner et al., 2015, Huang et al., 2017, Tran et al., 2019, Yim et al., 2015, Zhu et al., 2015, Zhu et al., 2013, Zhu et al., 2014) and artistic applications (e.g. video games and advertisements).

In this work, we focus on the applicability of face image synthesis for data augmentation. It is widely known that training data is one of the most important factors that affect the accuracy of deep learning models. The datasets used for training need to be large and contain sufficient variation to allow the resulting models to learn features that generalise well to unseen samples. In the case of face recognition, the datasets must contain many different subjects, as well as many different images per subject. The first requirement enables a model to learn inter-class discriminative features that can generalise to subjects not in the training set. The second requirement enables a model to learn features that are robust to intra-class variations. Even though there are several public large-scale datasets (Bansal et al., 2017, Cao et al., 2018, Guo et al., 2016, Nech and Kemelmacher-Shlizerman, 2017, Parkhi et al., 2015, Sun et al., 2014, Yi et al., 2014) that can be used to train CNN-based face recognition models, these datasets are nowhere near the size or quality of commercial datasets. For example, the largest publicly available dataset contains about 10M images of 100K different subjects (Guo et al. 2016), whereas Google’s FaceNet (Schroff et al. 2015) was trained with a private dataset containing between 100M and 200M face images of about 8M different subjects. Another issue is the presence of long-tail distributions in some publicly available datasets, i.e. datasets in which there are many subjects with very few images. Such unbalanced datasets can make the training process difficult and result in models that achieve lower accuracy than those trained with smaller but balanced datasets (Zhao et al., 2019, Zhou et al., 2015). In addition, some publicly available datasets (e.g. Guo et al. 2016) contain many mislabelled samples that can decrease face recognition accuracy if not discarded from the training set. Since collecting large-scale, good quality face datasets is a very expensive and labour-intensive task, we propose a method for generating photo-realistic face images that can be used to effectively increase the depth (number of images per subject) and width (number of subjects) of existing face datasets.

An approach that has recently gained popularity for augmenting face datasets is the use of 3D morphable models (Blanz and Vetter 2003). In this approach, new faces of existing subjects can be synthetised by fitting a 3D morphable model to existing images and modifying a variety of parameters to generate new poses and expressions (Masi et al., 2019, Masi et al., 2016, Mokhayeri et al., 2018, Zhao et al., 2019). It is also possible to generate images with other variations using this approach. For example, Mokhayeri et al. (2018) incorporated a reflectance model to generate images under different lighting conditions; and Kortylewski et al. (2018) randomly sampled 3D face shapes and colours to generate faces of new subjects. The main drawback of methods based on 3D morphable models is that the generated images often look unnatural and lack the level of detail found in real images. Another recent approach based on blending small triangular regions from different training images was proposed in Banerjee et al. (2017). Although this method seemed to produce photo-realistic faces, the authors limited their work to frontal face images. In contrast, our approach makes use of generative adversarial networks (GANs) (Goodfellow et al., 2014, Schmidhuber, 2020), which have recently been shown to produce photo-realistic in-the-wild images often indistinguishable from real images (Karras et al. 2018). Another advantage of using GANs is that they are end-to-end trainable models that do not require any domain-specific processing, as opposed to methods based on 3D modelling or face triangulation.

Many methods based on GANs have been proposed for manipulating attributes of existing face images, including age (Antipov et al., 2017, Yan et al., 2016, Zhang et al., 2017), facial expressions (Choi et al., 2018, Ding et al., 2018, Yan et al., 2016, Zhou and Shi, 2017), and other attributes such as hairstyle, glasses, makeup, facial hair, skin colour or gender (Brock et al., 2017, Choi et al., 2018, He et al., 2017, Lample et al., 2017, Larsen et al., 2016, Lu et al., 2018, Perarnau et al., 2016, Shen and Liu, 2017, Shu et al., 2017). While these methods can be used to increase the depth of a dataset, it remains unclear how to increase the width of a dataset, i.e. how to generate faces of new subjects. Our proposed GAN is able to generate faces from a latent representation z that has two Gaussian distributed components zid and znid encoding identity-related attributes and non-identity-related attributes respectively. In this way, face images of new subjects can be generated by fixing the identity component zid and varying the non-identity component znid. The method most closely related to ours is the semantically decomposed GAN (SD-GAN) proposed in Donahue et al. (2018). SD-GANs are trained with pairs of real images from the same subject and pairs of images generated with the same identity-related attributes but different non-identity-related attributes. A discriminator learns to reject pairs of images when either they do not look photo-realistic or when they do not appear to belong to the same subject. One of the main differences of our method with respect to SD-GANs is that it allows the generation of face images of subjects that exist in the training set. In other words, our method can increase both the width and the depth of a given face dataset. Furthermore, our proposed GAN is arguably simpler to implement than SD-GAN and easier to incorporate into other GAN architectures.

To demonstrate the efficacy of our method, we trained several CNN-based face recognition models with different combinations of real and synthetic data. In most cases, the models trained with a combination of real and synthetic data outperformed the models trained with real data alone.

Our main contributions can be summarised as:

  • A novel face image synthesis method based on GANs that allows the disentangling of identity-related attributes from non-identity-related attributes.

  • A data augmentation approach that uses the proposed GAN to increase the depth and width of existing face datasets.

  • A experimental demonstration that the proposed data augmentation approach can be used to increase the accuracy of a face recognition algorithm trained with real images alone.

The rest of this paper is organised as follows: Section 2 provides the background needed to understand our proposed GAN. Section 3 explains each part of our proposed GAN and the loss functions used for training. Section 4 discusses our experimental results, both in terms of the quality of the synthetic images generated by our proposed GAN and the accuracy achieved by datasets augmented with the synthetic images. Finally, our conclusions are presented in Section 5.

Section snippets

Background

Generative adversarial networks (GANs) generate data by sampling from a probability distribution pmodel that is trained to match a true data generating distribution pdata. This is done by mapping a vector of random latent variables zpz to a sample G(z) through a generator network G, where pz is a prior distribution that can be easily sampled (e.g. Gaussian or uniform). The generator is trained to fool a discriminator network D that tries to determine whether a sample is real or generated

Proposed method

In this section, we first explain our choice of GAN architecture and type of conditional GAN, and then our proposed modifications to disentangle identity-related attributes from non-identity-related attributes. Finally, we explain how we use the proposed GAN for augmenting existing face datasets.

Experiments

In this section, we start by providing a qualitative analysis of the synthetic images generated by our proposed GAN. Next, we explore the feasibility of augmenting face datasets with synthetic images, both in terms of width and depth. The augmented datasets are used to train CNN-based face recognition models (henceforth referred to as discriminative models) to determine whether they achieve a higher accuracy than models trained with real images alone.

We use the Face-Resnet architecture as our

Conclusions

In this paper, we have studied the feasibility of augmenting face datasets with photo-realistic synthetic images. In particular, we have presented a new type of conditional GAN that can generate photo-realistic face images from two latent vectors encoding identity-related attributes and non-identity-related attributes respectively. By fixing the latent vector of identity-related attributes and varying the latent vector of non-identity-related attributes, our proposed GAN can generate images of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research did not receive any specific grant from funding agencies in public, commercial, or not-for-profit sectors.

References (63)

  • SchmidhuberJ.

    Generative Adversarial Networks are special cases of Artificial Curiosity (1990) and also closely related to Predictability Minimization (1991)

    Neural Networks

    (2020)
  • Antipov, G., Baccouche, M., & Dugelay, J.-L. (2017). Face aging with conditional generative adversarial networks. In...
  • Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In Proc. int. conf. mach....
  • Banerjee, S., Bernhard, J. S., Scheirer, W. J., Bowyer, K. W., & Flynn, P. J. (2017). SREFI: Synthesis of realistic...
  • Bansal, A., Nanduri, A., Castillo, C. D., Ranjan, R., & Chellappa, R. (2017). UMDFaces: An annotated face dataset for...
  • BlanzV. et al.

    Face recognition based on fitting a 3D morphable model

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2003)
  • Brkic, K., Sikiric, I., Hrkac, T., & Kalafatic, Z. (2017). I know that person: generative full body and face...
  • Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. In...
  • Brock, A., Lim, T., Ritchie, J. M., & Weston, N. (2017). Neural photo editing with introspective adversarial networks....
  • Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across...
  • Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable...
  • Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., & Choo, J. (2018). StarGAN: Unified generative adversarial networks...
  • Ding, H., Sricharan, K., & Chellappa, R. (2018). ExprGAN: Facial expression editing with controllable expression...
  • DonahueC. et al.

    Semantically decomposing the latent spaces of generative adversarial networks

  • GoodfellowI.

    NIPS 2016 tutorial: Generative adversarial networks

    (2016)
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014)....
  • Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein GANs....
  • Guo, Y., Zhang, L., Hu, Y., He, X., & Gao, J. (2016). Ms-Celeb-1M: A dataset and benchmark for large-scale face...
  • Hasnat, A., Bohné, J., Milgram, J., Gentric, S., & Chen, L. (2017). DeepVisage: Making face recognition simple yet with...
  • Hassner, T., Harel, S., Paz, E., & Enbar, R. (2015). Effective face frontalization in unconstrained images. In Proc....
  • HeZ. et al.

    AttGAN: Facial attribute editing: Only change what you want

    (2017)
  • Huang, R., Zhang, S., Li, T., & He, R. (2017). Beyond face rotation: Global and local perception GAN for photorealistic...
  • KarrasT. et al.

    Progressive growing of gans for improved quality, stability, and variation

  • Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In...
  • KingmaD.P. et al.

    Auto-encoding variational bayes

  • Klare, B. F., Klein, B., Taborsky, E., Blanton, A., Cheney, J., Allen, K., Grother, P., Mah, A., & Jain, A. K. (2015)....
  • KortylewskiA. et al.

    Training deep face recognition systems with synthetic data

    (2018)
  • Lample, G., Zeghidour, N., Usunier, N., Bordes, A., & Denoyer, L. (2017). Fader networks: Manipulating images by...
  • Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2016). Autoencoding beyond pixels using a learned...
  • Lu, Y., Tai, Y.-W., & Tang, C.-K. (2018). Conditional CycleGAN for attribute guided face image generation. In Proc....
  • MakhzaniA. et al.

    Adversarial autoencoders

  • Cited by (17)

    • Face image-sketch synthesis via generative adversarial fusion

      2022, Neural Networks
      Citation Excerpt :

      For example, due to the unavailability of suspect face images, face sketches are drawn manually using professional software in terms of the witness description, and are further used for police hunting (Klare, Li, & Jain, 2011; Wang & Tang, 2009; Wang, Tao, Gao, Li, & Li, 2014). Another increasing public concern is the face spoofing attacks to face recognition systems (Peng, Wang, Li, & Gao, 2021; Saez Trigueros, Meng, & Hartnett, 2021). When verifying the security of a face recognition system, it tends to produce inaccurate verification results using the sketches to attack face recognition system.

    View all citing articles on Scopus
    View full text