Abstract
Image synthesis is a core problem in modern deep learning, and many recent architectures such as autoencoders and generative adversarial networks produce spectacular results on highly complex data, such as images of faces or landscapes. While these results open up a wide range of new, advanced synthesis applications, there is also a severe lack of theoretical understanding of how these networks work. This results in a wide range of practical problems, such as difficulties in training, the tendency to sample images with little or no variability and generalization problems. In this paper, we propose to analyze the ability of the simplest generative network, the autoencoder, to encode and decode two simple geometric attributes: size and position. We believe that, in order to understand more complicated tasks, it is necessary to first understand how these networks process simple attributes. For the first property, we analyze the case of images of centered disks with variable radii. We explain how the autoencoder projects these images to and from a latent space of smallest possible dimension, a scalar. In particular, we describe both the encoding process and a closed-form solution to the decoding training problem in a network without biases and shows that during training, the network indeed finds this solution. We then investigate the best regularization approaches which yield networks that generalize well. For the second property, position, we look at the encoding and decoding of Dirac delta functions, also known as “one-hot” vectors. We describe a handcrafted filter that achieves encoding perfectly and show that the network naturally finds this filter during training. We also show experimentally that the decoding can be achieved if the dataset is sampled in an appropriate manner. We hope that the insights given here will provide better understanding of the precise mechanisms used by generative networks and will ultimately contribute to producing more robust and generalizable networks.
Similar content being viewed by others
Notes
If not, then consider its mean on every circle, which decreases the \(L^2\) norm of f while maintaining the scalar product with any disk. We then can increase back the energy by dividing by this smaller \(L^2\) norm according to \(\Vert f\Vert _2=1\).
This obviously cannot happen in the case of a full autoencoder, but we must impose it when studying the encoder only.
References
Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data-generating distribution. J. Mach. Learn. Res. 15(1), 3563–3593 (2014)
Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-End Optimized Image Compression. arXiv:1611.01704 (2016)
Bengio, Y., Monperrus, M.: Non-local manifold tangent learning. In: Advances in Neural Information Processing Systems, pp. 129–136 (2005)
Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59(4), 291–294 (1988)
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2017)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (2011)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Ha, D., Eck, D.: A Neural Representation of Sketch Drawings. arXiv:1704.03477 (2017)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: IEEE Conference on Computer Vision and Pattern Recognition (2006). https://doi.org/10.1109/CVPR.2006.100
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive Growing of Gans for Improved Quality, Stability, and Variation arXiv:1710.10196 (2017)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2014)
Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., Ranzato, M.A.: Fader networks: Manipulating images by sliding attributes. In: Advances in Neural Information Processing Systems, pp. 5967–5976 (2017)
LeCun, Y.: Learning processes in an asymmetric threshold network. Ph.D. thesis, Paris VI (1987)
Liao, Y., Wang, Y., Liu, Y.: Graph regularized auto-encoders for image representation. IEEE Trans. Image Process. 26(6), 2839–2852 (2017)
Liu, R., Lehman, J., Molino, P., Such, F.P., Frank, E., Sergeev, A., Yosinski, J.: An Intriguing Failing of Convolutional Neural Networks and the Coordconv Solution. arXiv:1807.03247 (2018)
Makhzani, A., Frey, B.: K-sparse Autoencoders. arXiv:1312.5663 (2013)
Metz, L., Poole, B., Pfau, D., Sohl-Dickstein, J.: Unrolled Generative Adversarial Networks. arXiv:1611.02163 (2016)
Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 (2015)
Ranzato, M., Boureau, Y., LeCun, Y.: Sparse feature learning for deep belief networks. In: Conference on Neural Information Processing Systems (2007)
Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: Proceedings of the 28th International Conference on Machine Learning (2011)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)
Upchurch, P., Gardner, J.R., Pleiss, G., Pless, R., Snavely, N., Bala, K., Weinberger, K.Q.: Deep feature interpolation for image content changes. In: CVPR, vol. 1, p. 3 (2017)
Zhu, J.Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: European Conference on Computer Vision (2016)
Acknowledgements
This work was funded by the Agence Nationale de la Recherche (ANR) (Grant No. ANR-14-CE27-0019 (MIRIAM)), in the MIRIAM project.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Contractive Encoders Learn the Area of Disks
We study the encoder part of an autoencoder that takes an image and outputs a one-dimensional feature. We show that, with a simple constraint that the output of the encoder is not constant and in the absence of any other loss than the contractive loss, the feature is merely the area of the disk presented to the encoder.
We refer to the input image as \(x_r\), where r is the radius of the disk present in the image (one disk for each image, and each disk centered). In this simple setting, we will seek to find a function \(z:L^2(\varOmega )\rightarrow \mathbb {R}\) that stands for the encoder E, where \(\varOmega \) is the support of the images. The loss associated with a contractive autoencoder [22] is
where \(\nabla z \in L^2(\varOmega )\) stands for the gradient of the latent z with respect to the input image, when the input image is \(x_r\) (it is an image). \(R_{\max }\) is the maximum radius observed in the dataset, which we normalize to 1. Although the parameter of a loss is typically the set of parameters \(\theta \) of a network and is usually written \(\mathcal {L}(\theta )\), here we minimize among all possible encoders z simulating an infinite capacity of the encoder hence the notation \(\mathcal {L}(z)\).
We can take a continuous proxy for this loss and write
Note the integration against the simple measure \(\mathrm{d}r\) reflects the fact that the distribution of the radii is uniform. In anticipation of the derivations ahead, we suppose that the encoder function is smooth and that the edges of the shapes are also smooth. We will investigate what happens when the shapes become infinitely sharp after. We can express this by
where \(p=(p_x,p_y)\) is a position, \(\varphi \) is some smooth real function that is equal to 1 before -1, 0 after 1 (think of a simplified \(\text{ tanh }\) function) and \(\sigma \) is a scaling factor. When \(\sigma \) goes to zero, we will be in the case of sharp edges. Other smooth representations of a disk are possible, for example, \(x_r(p_x,p_y)=(\mathbb {1}_{B_r}*g_{\sigma })(p_x,p_y)\), where \(\mathbb {1}_{B_r}\) is the indicator function of the ball of radius r, as used in our experiments, and when \(\sigma \) goes to zero, we are back to sharp edges again. We will stick to the representation in Eq. (22) since it simplifies our calculations further on, in particular in Sect. A.1.
To avoid trivial cases, we also require our encoder not to be constant.Footnote 2 Once scaled, this constraint can be written as
the last equality being the chain rule. Let us denote
Now our problem boils down to
The minimization being performed among all possible z functions that are smooth enough to have a gradient with respect to its input x.
For a fixed r, among all \(\nabla z(x_r)\) satisfying
for some constant C(r), the one with minimal \(\Vert \nabla z(x_r)\Vert \) is of the form \(c(r)h_r\). To see this, write \(\nabla z(x_r)=\beta h_r+h_r^\perp \), a decomposition of \(\nabla z(x_r)\) on \(\mathrm {Vect}(h_r)\) and its orthogonal space (in \(L^2(\varOmega )\)). Hence, we can decrease the quantity to minimize in Eq. (25) without changing the constraint by projecting \(\nabla z(x_r)\) on \(\mathrm {Vect}(h_r)\). Thus, we can make the assumption that our solution z is such that
and we are reduced to finding a single function c that satisfies
where
Let us consider a small perturbation of the solution, \(c+\epsilon \delta \), for some smooth function \(\delta \) which satisfies \(\int _0^1 \delta (r)H_2(r)\mathrm{d}r=0\) (the derivative of the constraint). Then, we have
If we take the limit when \(\epsilon \rightarrow 0\), we have the condition
The solution of the system (27) is \(c(r)=C\) for C some constant, since the only function c(r) that satisfies Eq. (30) for any valid increment \(\delta \) is a constant one. Indeed, we have two conditions \(\delta \in \mathrm {Vect}(H_2)^\perp \) and \(\left\langle c H_2, \delta \right\rangle =0\). This means that \(c H_2 \in \left( \mathrm {Vect}(H_2)^\perp \right) ^\perp = \mathrm {Vect}(H_2)\).
Finally, when \(\sigma \), the edge width goes to zero the function \(h_r\) tends to be concentrated on a circle of radius r (see next Sect. A.1) and a value that is almost constant over the range of r. Roughly speaking this gives
For the sake of completeness, we have verified experimentally that the function \(h_r\) is indeed concentrated on a circle of radius r. These results can be seen in Fig. 11.
Finally, by integrating, we have
where \(\gamma \) is some constant.
1.1 A.1 Infinitely Thin Edges
Here, we show our claim when the edge width goes to 0, z(r) is indeed proportional to the disk area (Eq. (31)). We do this with the model described in Eq. (22).
where \(\varphi \) is some smooth function that is equal to 1 before -1, 0 after 1.
In this case, we have
The support of \(\varphi '_\sigma \) is \([-\sigma ,\sigma ]\). This function is radial and we are interested in computing (28), which gives
with the variable u being \(\sqrt{p_x^2+p_y^2}\).
For \(r\ge \sigma \), we have the following simple inequalities
where \(C=\int \varphi '^2(t)dt\). This confirms the behavior of \(H_2(r)\) as being merely proportional to r.
More precisely, we obtain (by integration as in (32))
which is the announced behavior for z. \(\square \)
1.2 A.2 Experimental Results
To further test this behavior experimentally, we have used our contractive autoencoder trained on disks, and applied it to a test set of images with squares and non-centered disks. In Fig. 12, it can be seen that the encoder indeed extracts the area of these objects and then outputs the disk with the closest area (since it has been trained on a disk database). This further confirms that the encoder is indeed extracting the area.
Appendix B: Creating the Disk Dataset
We wish to create a dataset which contains images of centered disks. Since the autoencoder must project each image to a continuous scalar, it makes sense to generate the disks with a continuous parameter r, and that the disks also be “continuous” in some sense (each different value of r should produce a different disk. For this, as we mentioned in Sect. 4.1, we create the training images \(x_r\) as
where \(\mathbb {1}_{\mathbb {B}_r}\) is the indicator function of the ball of radius r and \(g_{\sigma }\) is a Gaussian kernel with variance \(\sigma \). In practical terms, we carry this out using a Monte Carlo simulation to approximate the result of the convolution of an indicator function with a disk. Indeed, let \(\xi _{i, i=1 \ldots N}\) be a sequence of independently and identically distributed (iid) random variables, with \(\xi _i \sim \mathcal {N}\left( 0, \sigma \right) \). Each pixel at position t is evaluated as
According to the law of large numbers, this tends to the exact value of \(g_{\sigma } *\mathbb {1}_{\mathbb {B}_r}\) and gives a method of producing a continuous dataset.
While other approaches are available (evaluating the convolution in the Fourier domain, for example), this is simple to implement and generalizes to any shape which we can parametrize. We also note that the large majority of deep learning synthesis papers suppose that the data lie on some manifold, but this hypothesis is never checked. In our case, we explicitly sample the data in a smooth space.
Appendix C: Decoding of a Disk (Network with No Biases)
During the training of the autoencoder for the case of disks (with no bias in the autoencoder), the objective of the decoder is to convert a scalar into the image of a disk with the \(\ell _2\) distance as a metric. Given the profiles of the output of the autoencoder, we have made the hypothesis that the decoder approximates a disk of radius r with a function \(y(t;r):= D(E(\mathbb {1}_{B_r})) = h(r) f(t)\), where f is a continuous function. We show that this is true experimentally in Fig. 13 by determining f experimentally by taking the average of all output profiles and then comparing our code z against its theoretically optimal value \(\left\langle f,\mathbb {1}_{B_r}\right\rangle \). We see that they are the same up to a multiplicative constant C.
We now compare the numerical optimization of the energy in Eq. (7) using a gradient descent approach with the profile obtained by the autoencoder without biases. The resulting comparison can be seen in Fig. 14. One can also derive a closed-form solution of Eq. (7) by means of the Euler-Lagrange equation and see that the optimal f for Eq. (7) is the solution of the differential equation \(y^{\prime \prime }=-kty\) with initial state \((y,y^\prime )=(1,0)\), where k is a free positive constant that accommodates for the position of the first zero of y. This gives a closed form of the f in terms of Airy functions.
Appendix D: Autoencoding Disks with a Database with a Limited Observed Radius (Network with no Biases)
In Fig. 15, we see the gray-levels of the input/output of an autoencoder trained (without biases) on a restricted database, that is to say, a database whose disks have a maximum radius R which is smaller than the image width. We have used \(R=18\) for these experiments. We see that the decoder learns a useful function f which only extends to this maximum radius. Beyond this radius, another function is used corresponding to the other sign of codes (see Proposition 2) that is not tuned.
Appendix E: Autoencoding Disks with a DCGAN [20]
In Fig. 16, we show the autoencoding results of the DCGAN network of Radford et al. We trained their network with a code size of \(d=1\). As can be seen, the DCGAN learns to force the training data to a predefined distribution, which cannot be modified during training (contrary to the autoencoder). Thus, the network fails to correctly autoencode disks in the missing radius region which has not been observed in the training database.
Rights and permissions
About this article
Cite this article
Newson, A., Almansa, A., Gousseau, Y. et al. Processing Simple Geometric Attributes with Autoencoders. J Math Imaging Vis 62, 293–312 (2020). https://doi.org/10.1007/s10851-019-00924-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10851-019-00924-w