Controllable Image Generation with Semi-supervised Deep Learning and Deformable-Mean-Template Based Geometry-Appearance Disentanglement

doi:10.1016/j.patcog.2021.108001

Pattern Recognition

Volume 118, October 2021, 108001

https://doi.org/10.1016/j.patcog.2021.108001 Get rights and content

Highlights

•
Generative controllable neural-net model by explicitly disentangling geometry and appearance.
•
Learn geometry variability using population-mean template and per-individual deformations.
•
Learn appearance variability in image space designed to factor out geometric variability.
•
Semi-supervised variational learning with limited manually-annotated attributes.
•
Empirical analysis on two large public datasets, comparing several existing methods.

Abstract

Typical deep-neural-network (DNN) based generative image models often (i) show limited ability to learn a disentangled latent representation, (ii) show limited controllability leading to undesirable side effects when manipulating selected attributes during image generation, and (iii) require large attribute-annotated training sets. We propose a generative DNN model for face images by explicitly disentangling geometry and appearance modeling to achieve selective controllability of the desired attributes with less side effects. To learn geometric variability, we leverage grayscale sketch representations to learn (i) a deformable mean template representing the population-mean face geometry and (ii) a generative model of deformations to model individual face-geometry variations, using dense image registration. We learn the appearance variability in a (color-image) space that we explicitly design by factoring out the geometric variability. We propose a variational formulation to enable semi-supervised learning when manually-annotated attributes are severely limited in the training set. Results on large datasets show that, compared to schemes using deformation models or variational learning, our method significantly improves face-image model fits and facial-feature controllability even with semi-supervised learning.

Introduction

Generative modeling and representation learning are important in many fields including image analysis where deep neural networks (DNNs) can learn highly nonlinear mappings to transform low-dimensional parametric distributions to complex high-dimensional distributions in the space of images. Prevalent generative models either use variational learning, e.g., variational autoencoders (VAEs), or adversarial components, e.g., generative adversarial networks (GANs). GAN training, though, may be prone to instability [1]. Advanced deep generative modeling can simulate high-quality images, but there remain limitations in controllability to selectively manipulate a specified semantic attribute of objects without unintended side effects of modifying other attributes. Towards this end, disentangled representation learning aims for a compact representation space with uncorrelated/independent components that capture distinct semantic attributes. Unsupervised learning of a disentangled representation, although attempted before [2], [3], is theoretically impossible [2] without inductive biases on the data and the learning process. We propose semi-supervised learning for controllable image generation using a disentangled representation.

We propose a novel DNN-based framework for generative face-image modeling based on variational learning that models image attributes in a latent space, and which explicitly disentangles the major factors of variability, i.e., (i) geometry, by which we mean shape, size, pose, and location of facial features, and (ii) appearance, by which we mean color and texture features. We design for controllability of geometry and appearance attributes using separate generative DNNs, and this allows us to selectively manipulate a specific facial attribute (e.g., smiling lips) without undesirably changing other facial attributes (e.g., nose size or skin tone). We propose a deformable mean-template based model that learns the mean face geometry and the geometric variability in the population around the mean. We propose to learn the template in the space of sketches, by which we mean color images transformed to grayscale (and then contrast-inverted), using dense image registration. Subsequently, to learn the variability in appearance, we propose to first deform the color images to register their geometry to that of the mean sketch, and then learn the appearance variability in the resulting representation space. This design safeguards our two generative models for geometry and appearance from inadvertently modeling major variability in, respectively, appearance and geometry. Thus, we design the learning of geometry and appearance variability in representation spaces with reduced interdependencies, to gain controllability with fewer side effects.

This paper proposes a novel DNN-based generative model for face images by explicitly disentangling geometry and appearance modeling to achieve selective controllability of the desired attributes with less side effects. To learn geometric variability, we leverage grayscale sketch representations (devoid of color) to learn (i) a deformable mean template representing the population-mean face geometry and (ii) a generative model of deformations to model individual face-geometry variations around the mean, using dense image registration. We learn the appearance variability in a (color-image) space that we explicitly design by factoring out the geometric variability. We propose a novel variational formulation to enable semi-supervised learning when manually-annotated attributes are available for a tiny fraction of the training set. Results on CelebA [4] and LFW [5] datasets show that, compared to current methods using deformation models or variational learning, our framework significantly improves face-image model fits and facial-feature controllability even with semi-supervised learning.

Section snippets

Background and Related Work

Generative image models typically fail to either offer controllability during generation/manipulation based on semantic attributes or to model a disentangled interpretable feature representation in their latent spaces. Appearance and geometry disentanglement has been receiving attention recently, where many works [6], [7], [8], [9], [10] seek to separately model appearance and geometry. However, our framework differs significantly from all of these works. Like our method, [6] also employs

Proposed Method

We propose a 3-module DNN framework (Fig. 1) for controllable generative modeling of a class of images; detailed in the next 3 subsections. We explicitly disentangle the modeling of variation in geometry (shape, size, location, pose) and appearance (color, texture). We leverage semi-supervised variational learning when attributes are available for a tiny fraction of the training set.

Experiments, Results, and Discussion

Datasets. We use (i) CelebA [4] comprising over 202,000 images, and (ii) LFW [5], [15] comprising over 13,000 images. For each dataset, we select 80% images for training, 5% images for validation, and the rest for testing. For all methods, we tune the free parameters using a brute-force grid search with 5-fold cross-validation. We crop and rescale images to 64x64 pixels. We select attributes exhibiting good variation across the dataset: (i) 8 geometry attributes: Arched Eyebrows, Big Lips, Big

Conclusion

We present a novel method for controllable generation of face images based on an explicitly-designed disentangled modeling of geometry and appearance features using a deformable mean-template based approach. To learn geometric variability, we leverage grayscale sketch representations (devoid of color) to learn (i) a deformable mean template representing the population-mean face geometry and (ii) a generative model of deformations to model individual face-geometry variations around the mean,

Declaration of Competing Interest

None of the authors have any conflicts of interest to report.

Krishna Wadhwani is a Bachelor of Technology graduate Engineer from the Department of Aerospace Engineering at the Indian Institute of Technology Bombay, Mumbai, India. His research Interests include deep learning and computer vision. More details are available at https://krishnaw14.github.io/.

References (28)

S. Liu et al.
Discovering influential factors in variational autoencoders
Pattern Recognit.
(2020)
L. Gao et al.
Lightweight dynamic conditional GAN with pyramid attention for text-to-image synthesis
Pattern Recognit.
(2021)
W. Joo et al.
Dirichlet variational autoencoder
Pattern Recognit.
(2020)
M. Arjovsky et al.
Towards principled methods for training generative adversarial networks
Int Conf Learn Rep.
(2017)
F. Locatello et al.
Challenging common assumptions in the unsupervised learning of disentangled representations
Int Conf Learn Rep.
(2019)
Z. Liu et al.
Deep learning face attributes in the wild
IEEE Int Conf Comput Vis.
(2015)
C. Sanderson et al.
Multi-region probabilistic histograms for robust and scalable identity inference
Adv. Biometr.
(2009)
Z. Shu et al.
Deforming autoencoders: Unsupervised disentangling of shape and appearance
Proc Eur Conf Comput Vis.
(2018)
X. Xing et al.
Unsupervised disentangling of appearance and geometry by deformable generator network
Comput Vis Pattern Recognit.
(2019)
L. Tran et al.
Disentangling geometry and appearance with regularised geometry-aware generative adversarial networks
Int J Comput Vis.
(2019)

F. Xiao et al.

Identity from here, pose from there: Self-supervised disentanglement and generation of objects using unlabeled videos

IEEE Int Conf Comput Vis.

(2019)

A. Dabouei et al.

Boosting deep face recognition via disentangling appearance and geometry

Winter Conf Appl Comput Vis.

(2020)

Y. Li et al.

MixNMatch: Multifactor disentanglement and encoding for conditional image generation

Comput Vis Pattern Recognit.

(2020)

X. Yan et al.

Attribute2Image: Conditional image generation from visual attributes

Proc Eur Conf Comput Vis

(2016)

Cited by (7)

Global disentangled graph convolutional neural network based on a graph topological metric
2024, Knowledge-Based Systems
Graph convolutional networks (GCNs) are powerful tools for analyzing structured data with entities based on messages passing between a node and its surrounding nodes; these networks exhibit exceptional capabilities in diverse complex graph learning tasks. However, despite GCNs being capable of incorporating information from entities, they often neglect the structural connections between the entities generated by latent factors. In this study, we propose a global disentangled graph convolutional neural network based on a graph topological metric to identify these latent factors and perform graph-level disentanglement learning. In the proposed framework, a simple graph is accepted as the input and disentangled into several factorized graphs. Each factorized graph represents a latent factor and the disentangled relationship among the nodes. Specifically, our approach decouples the message passing process in GCNs into two distinct flows, feature and structural information flow. Importantly, a topological metric, named mean average distance, is introduced to promote the disentanglement among the factor graphs. Furthermore, we utilize the Jensen–Shannon MI estimator to promote disentanglement through feature information flow. Experiments on synthetic and real-world datasets demonstrated the superiority of our framework over state-of-the-art GNN networks. This work introduces a novel approach, preserving independence among latent factors while ensuring each factor maintains a consistent and interpretable meaning. We anticipate that this research can provide theoretical and technical analysis to further advance the understanding of graph disentanglement learning.
Classification of birdsong spectrograms based on DR-ACGAN and dynamic convolution
2023, Ecological Informatics
Birdsongs are highly valuable for bird studies as they provide insights into various aspects such as species distribution, population structures, and habitat. Recognizing birdsongs plays a crucial role in bird conservation efforts. However, manually collecting a large number of birdsongs from the natural environment is expensive and time-consuming. Moreover, using limited birdsong data often results in low classification accuracy of the models. To better identification of birdsongs, we utilize wavelet transform(WT) to convert them into spectrograms, which contain abundant energy and frequency information. Effectively extracting these features is vital to improve the classification accuracy of the model. To address this problem, we proposed an improved ACGAN model based on residual structure and attention mechanism named DR-ACGAN, which achieved stable training of the model and high-quality generated birdsong spectrograms. The dynamic convolution kernel is then fused with MobileNetV2, ResNet18, and VGG16 models and trained on different datasets, which used different ways of mixing the generated and original spectrograms. The experimental results show that the classification accuracy after data augmentation improves by 6.66%, 4.35%, and 2.29% compared to the original dataset in the three base classifiers. After adding dynamic convolutional kernel structure, the accuracy is further improved by 1.68%, 0.67%, and 0.38% on average which the VGG16 model achieves the highest accuracy of 97.60%.
Hierarchical disentangling network for object representation learning
2023, Pattern Recognition
An object can be described as the combination of primary visual attributes. Disentangling such underlying primitives is the long-term objective of representation learning. It is observed that categories have natural hierarchical characteristics, i.e., any two objects can share some common primitives at a particular category level while possess unique traits at another. However, previous works usually operate in a flat manner (i.e., at a particular level) to disentangle the representations of objects. Even though they may obtain the primitives to constitute objects as the categories at that level, their results are obviously not efficient and complete. In this paper, we propose a Hierarchical Disentangling Network (HDN) to exploit the rich hierarchical characteristics among categories to divide the disentangling process in a coarse-to-fine manner (i.e., level-wise), such that each level only focuses on learning the specific representations and finally the common and unique representations at all levels jointly constitute the raw object. Specifically, HDN is designed based on an encoder-decoder architecture. To simultaneously ensure the level-wise disentanglement and interpretability of the encoded representations, a novel hierarchical Generative Adversarial Network (GAN) is introduced. Quantitative and qualitative evaluations on popular object datasets validate the effectiveness of our method.
Multi-View Video Quality Enhancement Method Based on Multi-Scale Fusion Convolutional Neural Network and Visual Saliency
2024, IEEE Access
Image Caption: Explaining Pictures by Text using Deep Learning
2023, Proceedings of the 1st IEEE International Conference on Networking and Communications 2023, ICNWC 2023
Trajectory prediction based on conditional Hamiltonian generative network for incomplete observation image sequences
2022, Complex and Intelligent Systems

View all citing articles on Scopus

Suyash P. Awate is an Associate Professor in the Department of Computer Science and Engineering at the Indian Institute of Technology Bombay, Mumbai, India. His research Interests include image analysis, machine learning, medical image computing, and computer vision. More details are available at https://www.cse.iitb.ac.in/~suyash/.

^☆: Thanks to the Infrastructure Facility for Advanced Research and Education in Diagnostics grant funded by Department of Biotechnology, Government of India (BT/INF/22/SP23026/2017).

View full text

Controllable Image Generation with Semi-supervised Deep Learning and Deformable-Mean-Template Based Geometry-Appearance Disentanglement☆

Highlights

Abstract

Introduction

Section snippets

Background and Related Work

Proposed Method

Experiments, Results, and Discussion

Conclusion

Declaration of Competing Interest

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Towards principled methods for training generative adversarial networks

Int Conf Learn Rep.

Challenging common assumptions in the unsupervised learning of disentangled representations

Int Conf Learn Rep.

Deep learning face attributes in the wild

IEEE Int Conf Comput Vis.

Multi-region probabilistic histograms for robust and scalable identity inference

Adv. Biometr.

Deforming autoencoders: Unsupervised disentangling of shape and appearance

Proc Eur Conf Comput Vis.

Unsupervised disentangling of appearance and geometry by deformable generator network

Comput Vis Pattern Recognit.

Disentangling geometry and appearance with regularised geometry-aware generative adversarial networks

Int J Comput Vis.

Identity from here, pose from there: Self-supervised disentanglement and generation of objects using unlabeled videos

IEEE Int Conf Comput Vis.

Boosting deep face recognition via disentangling appearance and geometry

Winter Conf Appl Comput Vis.

MixNMatch: Multifactor disentanglement and encoding for conditional image generation

Comput Vis Pattern Recognit.

Attribute2Image: Conditional image generation from visual attributes

Proc Eur Conf Comput Vis