Elsevier

Pattern Recognition

Volume 118, October 2021, 108001
Pattern Recognition

Controllable Image Generation with Semi-supervised Deep Learning and Deformable-Mean-Template Based Geometry-Appearance Disentanglement

https://doi.org/10.1016/j.patcog.2021.108001Get rights and content

Highlights

  • Generative controllable neural-net model by explicitly disentangling geometry and appearance.

  • Learn geometry variability using population-mean template and per-individual deformations.

  • Learn appearance variability in image space designed to factor out geometric variability.

  • Semi-supervised variational learning with limited manually-annotated attributes.

  • Empirical analysis on two large public datasets, comparing several existing methods.

Abstract

Typical deep-neural-network (DNN) based generative image models often (i) show limited ability to learn a disentangled latent representation, (ii) show limited controllability leading to undesirable side effects when manipulating selected attributes during image generation, and (iii) require large attribute-annotated training sets. We propose a generative DNN model for face images by explicitly disentangling geometry and appearance modeling to achieve selective controllability of the desired attributes with less side effects. To learn geometric variability, we leverage grayscale sketch representations to learn (i) a deformable mean template representing the population-mean face geometry and (ii) a generative model of deformations to model individual face-geometry variations, using dense image registration. We learn the appearance variability in a (color-image) space that we explicitly design by factoring out the geometric variability. We propose a variational formulation to enable semi-supervised learning when manually-annotated attributes are severely limited in the training set. Results on large datasets show that, compared to schemes using deformation models or variational learning, our method significantly improves face-image model fits and facial-feature controllability even with semi-supervised learning.

Introduction

Generative modeling and representation learning are important in many fields including image analysis where deep neural networks (DNNs) can learn highly nonlinear mappings to transform low-dimensional parametric distributions to complex high-dimensional distributions in the space of images. Prevalent generative models either use variational learning, e.g., variational autoencoders (VAEs), or adversarial components, e.g., generative adversarial networks (GANs). GAN training, though, may be prone to instability [1]. Advanced deep generative modeling can simulate high-quality images, but there remain limitations in controllability to selectively manipulate a specified semantic attribute of objects without unintended side effects of modifying other attributes. Towards this end, disentangled representation learning aims for a compact representation space with uncorrelated/independent components that capture distinct semantic attributes. Unsupervised learning of a disentangled representation, although attempted before [2], [3], is theoretically impossible [2] without inductive biases on the data and the learning process. We propose semi-supervised learning for controllable image generation using a disentangled representation.

We propose a novel DNN-based framework for generative face-image modeling based on variational learning that models image attributes in a latent space, and which explicitly disentangles the major factors of variability, i.e., (i) geometry, by which we mean shape, size, pose, and location of facial features, and (ii) appearance, by which we mean color and texture features. We design for controllability of geometry and appearance attributes using separate generative DNNs, and this allows us to selectively manipulate a specific facial attribute (e.g., smiling lips) without undesirably changing other facial attributes (e.g., nose size or skin tone). We propose a deformable mean-template based model that learns the mean face geometry and the geometric variability in the population around the mean. We propose to learn the template in the space of sketches, by which we mean color images transformed to grayscale (and then contrast-inverted), using dense image registration. Subsequently, to learn the variability in appearance, we propose to first deform the color images to register their geometry to that of the mean sketch, and then learn the appearance variability in the resulting representation space. This design safeguards our two generative models for geometry and appearance from inadvertently modeling major variability in, respectively, appearance and geometry. Thus, we design the learning of geometry and appearance variability in representation spaces with reduced interdependencies, to gain controllability with fewer side effects.

This paper proposes a novel DNN-based generative model for face images by explicitly disentangling geometry and appearance modeling to achieve selective controllability of the desired attributes with less side effects. To learn geometric variability, we leverage grayscale sketch representations (devoid of color) to learn (i) a deformable mean template representing the population-mean face geometry and (ii) a generative model of deformations to model individual face-geometry variations around the mean, using dense image registration. We learn the appearance variability in a (color-image) space that we explicitly design by factoring out the geometric variability. We propose a novel variational formulation to enable semi-supervised learning when manually-annotated attributes are available for a tiny fraction of the training set. Results on CelebA [4] and LFW [5] datasets show that, compared to current methods using deformation models or variational learning, our framework significantly improves face-image model fits and facial-feature controllability even with semi-supervised learning.

Section snippets

Background and Related Work

Generative image models typically fail to either offer controllability during generation/manipulation based on semantic attributes or to model a disentangled interpretable feature representation in their latent spaces. Appearance and geometry disentanglement has been receiving attention recently, where many works [6], [7], [8], [9], [10] seek to separately model appearance and geometry. However, our framework differs significantly from all of these works. Like our method, [6] also employs

Proposed Method

We propose a 3-module DNN framework (Fig. 1) for controllable generative modeling of a class of images; detailed in the next 3 subsections. We explicitly disentangle the modeling of variation in geometry (shape, size, location, pose) and appearance (color, texture). We leverage semi-supervised variational learning when attributes are available for a tiny fraction of the training set.

Experiments, Results, and Discussion

Datasets. We use (i) CelebA [4] comprising over 202,000 images, and (ii) LFW [5], [15] comprising over 13,000 images. For each dataset, we select 80% images for training, 5% images for validation, and the rest for testing. For all methods, we tune the free parameters using a brute-force grid search with 5-fold cross-validation. We crop and rescale images to 64x64 pixels. We select attributes exhibiting good variation across the dataset: (i) 8 geometry attributes: Arched Eyebrows, Big Lips, Big

Conclusion

We present a novel method for controllable generation of face images based on an explicitly-designed disentangled modeling of geometry and appearance features using a deformable mean-template based approach. To learn geometric variability, we leverage grayscale sketch representations (devoid of color) to learn (i) a deformable mean template representing the population-mean face geometry and (ii) a generative model of deformations to model individual face-geometry variations around the mean,

Declaration of Competing Interest

None of the authors have any conflicts of interest to report.

Krishna Wadhwani is a Bachelor of Technology graduate Engineer from the Department of Aerospace Engineering at the Indian Institute of Technology Bombay, Mumbai, India. His research Interests include deep learning and computer vision. More details are available at https://krishnaw14.github.io/.

References (28)

  • S. Liu et al.

    Discovering influential factors in variational autoencoders

    Pattern Recognit.

    (2020)
  • L. Gao et al.

    Lightweight dynamic conditional GAN with pyramid attention for text-to-image synthesis

    Pattern Recognit.

    (2021)
  • W. Joo et al.

    Dirichlet variational autoencoder

    Pattern Recognit.

    (2020)
  • M. Arjovsky et al.

    Towards principled methods for training generative adversarial networks

    Int Conf Learn Rep.

    (2017)
  • F. Locatello et al.

    Challenging common assumptions in the unsupervised learning of disentangled representations

    Int Conf Learn Rep.

    (2019)
  • Z. Liu et al.

    Deep learning face attributes in the wild

    IEEE Int Conf Comput Vis.

    (2015)
  • C. Sanderson et al.

    Multi-region probabilistic histograms for robust and scalable identity inference

    Adv. Biometr.

    (2009)
  • Z. Shu et al.

    Deforming autoencoders: Unsupervised disentangling of shape and appearance

    Proc Eur Conf Comput Vis.

    (2018)
  • X. Xing et al.

    Unsupervised disentangling of appearance and geometry by deformable generator network

    Comput Vis Pattern Recognit.

    (2019)
  • L. Tran et al.

    Disentangling geometry and appearance with regularised geometry-aware generative adversarial networks

    Int J Comput Vis.

    (2019)
  • F. Xiao et al.

    Identity from here, pose from there: Self-supervised disentanglement and generation of objects using unlabeled videos

    IEEE Int Conf Comput Vis.

    (2019)
  • A. Dabouei et al.

    Boosting deep face recognition via disentangling appearance and geometry

    Winter Conf Appl Comput Vis.

    (2020)
  • Y. Li et al.

    MixNMatch: Multifactor disentanglement and encoding for conditional image generation

    Comput Vis Pattern Recognit.

    (2020)
  • X. Yan et al.

    Attribute2Image: Conditional image generation from visual attributes

    Proc Eur Conf Comput Vis

    (2016)
  • Cited by (7)

    • Image Caption: Explaining Pictures by Text using Deep Learning

      2023, Proceedings of the 1st IEEE International Conference on Networking and Communications 2023, ICNWC 2023
    View all citing articles on Scopus

    Krishna Wadhwani is a Bachelor of Technology graduate Engineer from the Department of Aerospace Engineering at the Indian Institute of Technology Bombay, Mumbai, India. His research Interests include deep learning and computer vision. More details are available at https://krishnaw14.github.io/.

    Suyash P. Awate is an Associate Professor in the Department of Computer Science and Engineering at the Indian Institute of Technology Bombay, Mumbai, India. His research Interests include image analysis, machine learning, medical image computing, and computer vision. More details are available at https://www.cse.iitb.ac.in/~suyash/.

    Thanks to the Infrastructure Facility for Advanced Research and Education in Diagnostics grant funded by Department of Biotechnology, Government of India (BT/INF/22/SP23026/2017).

    View full text