Skip to main content
Log in

Processing Simple Geometric Attributes with Autoencoders

  • Published:
Journal of Mathematical Imaging and Vision Aims and scope Submit manuscript

Abstract

Image synthesis is a core problem in modern deep learning, and many recent architectures such as autoencoders and generative adversarial networks produce spectacular results on highly complex data, such as images of faces or landscapes. While these results open up a wide range of new, advanced synthesis applications, there is also a severe lack of theoretical understanding of how these networks work. This results in a wide range of practical problems, such as difficulties in training, the tendency to sample images with little or no variability and generalization problems. In this paper, we propose to analyze the ability of the simplest generative network, the autoencoder, to encode and decode two simple geometric attributes: size and position. We believe that, in order to understand more complicated tasks, it is necessary to first understand how these networks process simple attributes. For the first property, we analyze the case of images of centered disks with variable radii. We explain how the autoencoder projects these images to and from a latent space of smallest possible dimension, a scalar. In particular, we describe both the encoding process and a closed-form solution to the decoding training problem in a network without biases and shows that during training, the network indeed finds this solution. We then investigate the best regularization approaches which yield networks that generalize well. For the second property, position, we look at the encoding and decoding of Dirac delta functions, also known as “one-hot” vectors. We describe a handcrafted filter that achieves encoding perfectly and show that the network naturally finds this filter during training. We also show experimentally that the decoding can be achieved if the dataset is sampled in an appropriate manner. We hope that the insights given here will provide better understanding of the precise mechanisms used by generative networks and will ultimately contribute to producing more robust and generalizable networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. If not, then consider its mean on every circle, which decreases the \(L^2\) norm of f while maintaining the scalar product with any disk. We then can increase back the energy by dividing by this smaller \(L^2\) norm according to \(\Vert f\Vert _2=1\).

  2. This obviously cannot happen in the case of a full autoencoder, but we must impose it when studying the encoder only.

References

  1. Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data-generating distribution. J. Mach. Learn. Res. 15(1), 3563–3593 (2014)

    MathSciNet  MATH  Google Scholar 

  2. Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-End Optimized Image Compression. arXiv:1611.01704 (2016)

  3. Bengio, Y., Monperrus, M.: Non-local manifold tangent learning. In: Advances in Neural Information Processing Systems, pp. 129–136 (2005)

  4. Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59(4), 291–294 (1988)

    Article  MathSciNet  Google Scholar 

  5. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2017)

  6. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (2011)

  7. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)

    MATH  Google Scholar 

  8. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

  9. Ha, D., Eck, D.: A Neural Representation of Sketch Drawings. arXiv:1704.03477 (2017)

  10. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: IEEE Conference on Computer Vision and Pattern Recognition (2006). https://doi.org/10.1109/CVPR.2006.100

  11. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)

  12. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive Growing of Gans for Improved Quality, Stability, and Variation arXiv:1710.10196 (2017)

  13. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2014)

  14. Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., Ranzato, M.A.: Fader networks: Manipulating images by sliding attributes. In: Advances in Neural Information Processing Systems, pp. 5967–5976 (2017)

  15. LeCun, Y.: Learning processes in an asymmetric threshold network. Ph.D. thesis, Paris VI (1987)

  16. Liao, Y., Wang, Y., Liu, Y.: Graph regularized auto-encoders for image representation. IEEE Trans. Image Process. 26(6), 2839–2852 (2017)

    Article  MathSciNet  Google Scholar 

  17. Liu, R., Lehman, J., Molino, P., Such, F.P., Frank, E., Sergeev, A., Yosinski, J.: An Intriguing Failing of Convolutional Neural Networks and the Coordconv Solution. arXiv:1807.03247 (2018)

  18. Makhzani, A., Frey, B.: K-sparse Autoencoders. arXiv:1312.5663 (2013)

  19. Metz, L., Poole, B., Pfau, D., Sohl-Dickstein, J.: Unrolled Generative Adversarial Networks. arXiv:1611.02163 (2016)

  20. Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 (2015)

  21. Ranzato, M., Boureau, Y., LeCun, Y.: Sparse feature learning for deep belief networks. In: Conference on Neural Information Processing Systems (2007)

  22. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: Proceedings of the 28th International Conference on Machine Learning (2011)

  23. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)

  24. Upchurch, P., Gardner, J.R., Pleiss, G., Pless, R., Snavely, N., Bala, K., Weinberger, K.Q.: Deep feature interpolation for image content changes. In: CVPR, vol. 1, p. 3 (2017)

  25. Zhu, J.Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: European Conference on Computer Vision (2016)

Download references

Acknowledgements

This work was funded by the Agence Nationale de la Recherche (ANR) (Grant No. ANR-14-CE27-0019 (MIRIAM)), in the MIRIAM project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alasdair Newson.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Contractive Encoders Learn the Area of Disks

We study the encoder part of an autoencoder that takes an image and outputs a one-dimensional feature. We show that, with a simple constraint that the output of the encoder is not constant and in the absence of any other loss than the contractive loss, the feature is merely the area of the disk presented to the encoder.

We refer to the input image as \(x_r\), where r is the radius of the disk present in the image (one disk for each image, and each disk centered). In this simple setting, we will seek to find a function \(z:L^2(\varOmega )\rightarrow \mathbb {R}\) that stands for the encoder E, where \(\varOmega \) is the support of the images. The loss associated with a contractive autoencoder [22] is

$$\begin{aligned} \mathcal {L}(z)=\sum _{r=0}^{R_{\max }} \Vert \nabla z(x_r)\Vert ^2, \end{aligned}$$
(20)

where \(\nabla z \in L^2(\varOmega )\) stands for the gradient of the latent z with respect to the input image, when the input image is \(x_r\) (it is an image). \(R_{\max }\) is the maximum radius observed in the dataset, which we normalize to 1. Although the parameter of a loss is typically the set of parameters \(\theta \) of a network and is usually written \(\mathcal {L}(\theta )\), here we minimize among all possible encoders z simulating an infinite capacity of the encoder hence the notation \(\mathcal {L}(z)\).

We can take a continuous proxy for this loss and write

$$\begin{aligned} L(z)=\int _0^1 ||\nabla z(x_r)||_2^2 \mathrm{d}r, \end{aligned}$$
(21)

Note the integration against the simple measure \(\mathrm{d}r\) reflects the fact that the distribution of the radii is uniform. In anticipation of the derivations ahead, we suppose that the encoder function is smooth and that the edges of the shapes are also smooth. We will investigate what happens when the shapes become infinitely sharp after. We can express this by

$$\begin{aligned} x_r(p_x,p_y)= & {} \varphi \left( \frac{\sqrt{p_x^2+p_y^2}-r}{\sigma }\right) \nonumber \\= & {} \varphi _\sigma \left( \sqrt{p_x^2+p_y^2}-r\right) , \end{aligned}$$
(22)

where \(p=(p_x,p_y)\) is a position, \(\varphi \) is some smooth real function that is equal to 1 before -1, 0 after 1 (think of a simplified \(\text{ tanh }\) function) and \(\sigma \) is a scaling factor. When \(\sigma \) goes to zero, we will be in the case of sharp edges. Other smooth representations of a disk are possible, for example, \(x_r(p_x,p_y)=(\mathbb {1}_{B_r}*g_{\sigma })(p_x,p_y)\), where \(\mathbb {1}_{B_r}\) is the indicator function of the ball of radius r, as used in our experiments, and when \(\sigma \) goes to zero, we are back to sharp edges again. We will stick to the representation in Eq. (22) since it simplifies our calculations further on, in particular in Sect. A.1.

To avoid trivial cases, we also require our encoder not to be constant.Footnote 2 Once scaled, this constraint can be written as

$$\begin{aligned} 1=z(x_1)-z(x_0)=\int _0^1\frac{\partial z}{\partial r} \mathrm{d}r=\int _0^1 \left\langle \nabla z|\frac{\partial x_r}{\partial r} \right\rangle \mathrm{d}r, \end{aligned}$$
(23)

the last equality being the chain rule. Let us denote

$$\begin{aligned} h_r(p):=\frac{\partial x_r}{\partial r}(p). \end{aligned}$$
(24)

Now our problem boils down to

$$\begin{aligned} \begin{array}{l@{\quad }l} \text{ Minimize } \text{: }&{}\displaystyle \int _0^1\Vert \nabla z(x_r)\Vert ^2\mathrm{d}r\\ \text{ Under } \text{ the } \text{ constraint } \text{: }&{}\displaystyle \int _0^1\left\langle \nabla z(x_r)|h_r\right\rangle \mathrm{d}r =1 \end{array} \end{aligned}$$
(25)

The minimization being performed among all possible z functions that are smooth enough to have a gradient with respect to its input x.

For a fixed r, among all \(\nabla z(x_r)\) satisfying

$$\begin{aligned} \left\langle \nabla z(x_r)|h_r\right\rangle =C(r) \end{aligned}$$

for some constant C(r), the one with minimal \(\Vert \nabla z(x_r)\Vert \) is of the form \(c(r)h_r\). To see this, write \(\nabla z(x_r)=\beta h_r+h_r^\perp \), a decomposition of \(\nabla z(x_r)\) on \(\mathrm {Vect}(h_r)\) and its orthogonal space (in \(L^2(\varOmega )\)). Hence, we can decrease the quantity to minimize in Eq. (25) without changing the constraint by projecting \(\nabla z(x_r)\) on \(\mathrm {Vect}(h_r)\). Thus, we can make the assumption that our solution z is such that

$$\begin{aligned} \nabla z(x_r)=c(r)h_r, \end{aligned}$$
(26)

and we are reduced to finding a single function c that satisfies

$$\begin{aligned} \begin{array}{l@{\quad }l} \text{ Minimize } \text{: }~&{}\displaystyle \int _0^1 c(r)^2H_2(r)\mathrm{d}r\\ \text{ Under } \text{ the } \text{ constraint } \text{: }~&{}\displaystyle \int _0^1 c(r)H_2(r)\mathrm{d}r =1, \end{array} \end{aligned}$$
(27)

where

$$\begin{aligned} H_2(r)=\iint _{}h_r(p_x,p_y)^2dp_xdp_y. \end{aligned}$$
(28)
Fig. 11
figure 11

Absolute value of the gradient of the code z with respect to \(x_r\). We verify that in the case of a contractive encoder, the gradient of the code z of the disk image \(x_r\), with respect to the image itself, is indeed concentrated on a circle of radius r. This behavior is important to show that the contractive encoder indeed extracts the area of the disk

Let us consider a small perturbation of the solution, \(c+\epsilon \delta \), for some smooth function \(\delta \) which satisfies \(\int _0^1 \delta (r)H_2(r)\mathrm{d}r=0\) (the derivative of the constraint). Then, we have

$$\begin{aligned}&\frac{\mathrm{d}}{\mathrm{d}\epsilon } \left( \int _0^1 \left( c(r)+\epsilon \delta (r)\right) ^2H_2(r)\mathrm{d}r \right) \nonumber \\&\quad = \int _0^1 \left( 2 c(r) \delta (r) + 2 \epsilon \delta (r)^2\right) H_2(r) \mathrm{d}r. \end{aligned}$$
(29)

If we take the limit when \(\epsilon \rightarrow 0\), we have the condition

$$\begin{aligned} \int _0^1 c(r) \delta (r) H_2(r) = 0 \end{aligned}$$
(30)

The solution of the system (27) is \(c(r)=C\) for C some constant, since the only function c(r) that satisfies Eq. (30) for any valid increment \(\delta \) is a constant one. Indeed, we have two conditions \(\delta \in \mathrm {Vect}(H_2)^\perp \) and \(\left\langle c H_2, \delta \right\rangle =0\). This means that \(c H_2 \in \left( \mathrm {Vect}(H_2)^\perp \right) ^\perp = \mathrm {Vect}(H_2)\).

Finally, when \(\sigma \), the edge width goes to zero the function \(h_r\) tends to be concentrated on a circle of radius r (see next Sect. A.1) and a value that is almost constant over the range of r. Roughly speaking this gives

$$\begin{aligned} H_2(r)=2\pi r\alpha . \end{aligned}$$
(31)

For the sake of completeness, we have verified experimentally that the function \(h_r\) is indeed concentrated on a circle of radius r. These results can be seen in Fig. 11.

Finally, by integrating, we have

$$\begin{aligned} z(r)=\int _0^r \frac{\mathrm{d}z}{\mathrm{d}\rho } \mathrm{d}\rho =\int _0^r c(\rho )H_2(\rho )\mathrm{d}\rho =\gamma r^2, \end{aligned}$$
(32)

where \(\gamma \) is some constant.

Fig. 12
figure 12

Output of an autoencoder trained on disks, applied to squares and non-centered disks during testing. The autoencoder indeed extracts the area of the object, regardless of its shape or position. Since it was trained on disks, it outputs the disks with a similar area to the objects observed during testing. Note that the autoencoder does not extract position, since it was trained on centered disks

1.1 A.1 Infinitely Thin Edges

Here, we show our claim when the edge width goes to 0, z(r) is indeed proportional to the disk area (Eq. (31)). We do this with the model described in Eq. (22).

$$\begin{aligned} x_r(p_x,p_y)= & {} \varphi \left( \frac{\sqrt{p_x^2+p_y^2}-r}{\sigma }\right) \nonumber \\= & {} \varphi _\sigma \left( \sqrt{p_x^2+p_y^2}-r\right) , \end{aligned}$$
(33)

where \(\varphi \) is some smooth function that is equal to 1 before -1, 0 after 1.

In this case, we have

$$\begin{aligned} \frac{\partial x_r}{\partial r}(p_x,p_y)=-\varphi '_\sigma (\sqrt{p_x^2+p_y^2}-r). \end{aligned}$$
(34)

The support of \(\varphi '_\sigma \) is \([-\sigma ,\sigma ]\). This function is radial and we are interested in computing (28), which gives

$$\begin{aligned} H_2(r)=\int _{r-\sigma }^{r+\sigma }2\pi u\left( \varphi '_\sigma (u-r)\right) ^2du \end{aligned}$$
(35)

with the variable u being \(\sqrt{p_x^2+p_y^2}\).

For \(r\ge \sigma \), we have the following simple inequalities

$$\begin{aligned} 2\pi (r-\sigma )\sigma ^2C\le H_2(r)\le 2\pi (r+\sigma ) \sigma ^2C \end{aligned}$$
(36)

where \(C=\int \varphi '^2(t)dt\). This confirms the behavior of \(H_2(r)\) as being merely proportional to r.

More precisely, we obtain (by integration as in (32))

$$\begin{aligned} \gamma (r^2-\sigma r)\le z(r)\le \gamma (r^2-\sigma r), \end{aligned}$$
(37)

which is the announced behavior for z. \(\square \)

1.2 A.2 Experimental Results

To further test this behavior experimentally, we have used our contractive autoencoder trained on disks, and applied it to a test set of images with squares and non-centered disks. In Fig. 12, it can be seen that the encoder indeed extracts the area of these objects and then outputs the disk with the closest area (since it has been trained on a disk database). This further confirms that the encoder is indeed extracting the area.

Appendix B: Creating the Disk Dataset

We wish to create a dataset which contains images of centered disks. Since the autoencoder must project each image to a continuous scalar, it makes sense to generate the disks with a continuous parameter r, and that the disks also be “continuous” in some sense (each different value of r should produce a different disk. For this, as we mentioned in Sect. 4.1, we create the training images \(x_r\) as

$$\begin{aligned} x_r = g_{\sigma } *\mathbb {1}_{\mathbb {B}_r}, \end{aligned}$$
(38)

where \(\mathbb {1}_{\mathbb {B}_r}\) is the indicator function of the ball of radius r and \(g_{\sigma }\) is a Gaussian kernel with variance \(\sigma \). In practical terms, we carry this out using a Monte Carlo simulation to approximate the result of the convolution of an indicator function with a disk. Indeed, let \(\xi _{i, i=1 \ldots N}\) be a sequence of independently and identically distributed (iid) random variables, with \(\xi _i \sim \mathcal {N}\left( 0, \sigma \right) \). Each pixel at position t is evaluated as

$$\begin{aligned} x_r(t) = \frac{1}{N} \sum _{i=1}^{N} \mathbb {1}_{B_r}(\xi _i). \end{aligned}$$
(39)

According to the law of large numbers, this tends to the exact value of \(g_{\sigma } *\mathbb {1}_{\mathbb {B}_r}\) and gives a method of producing a continuous dataset.

While other approaches are available (evaluating the convolution in the Fourier domain, for example), this is simple to implement and generalizes to any shape which we can parametrize. We also note that the large majority of deep learning synthesis papers suppose that the data lie on some manifold, but this hypothesis is never checked. In our case, we explicitly sample the data in a smooth space.

Fig. 13
figure 13

Verification of the theoretical derivations that use the hypothesis that \(y(t,r) = h(r) f(t)\) for decoding, in the case where the autoencoder contains no bias. We have plotted z against the theoretically optimal value of h (\(C\left\langle f,\mathbb {1}_{B_r}\right\rangle \), where C is some constant accounting for the arbitrary normalization of f). This experimental sanity check confirms our theoretical derivations

Fig. 14
figure 14

Comparison of the empirical function f of the autoencoder without biases with the numerical minimization of Eq. (7). We have determined the empirical function f of the autoencoder and compared it with the minimization of Eq. (7). The resulting profiles are similar, showing that the autoencoder indeed succeeds in minimizing this energy

Appendix C: Decoding of a Disk (Network with No Biases)

During the training of the autoencoder for the case of disks (with no bias in the autoencoder), the objective of the decoder is to convert a scalar into the image of a disk with the \(\ell _2\) distance as a metric. Given the profiles of the output of the autoencoder, we have made the hypothesis that the decoder approximates a disk of radius r with a function \(y(t;r):= D(E(\mathbb {1}_{B_r})) = h(r) f(t)\), where f is a continuous function. We show that this is true experimentally in Fig. 13 by determining f experimentally by taking the average of all output profiles and then comparing our code z against its theoretically optimal value \(\left\langle f,\mathbb {1}_{B_r}\right\rangle \). We see that they are the same up to a multiplicative constant C.

We now compare the numerical optimization of the energy in Eq. (7) using a gradient descent approach with the profile obtained by the autoencoder without biases. The resulting comparison can be seen in Fig. 14. One can also derive a closed-form solution of Eq. (7) by means of the Euler-Lagrange equation and see that the optimal f for Eq. (7) is the solution of the differential equation \(y^{\prime \prime }=-kty\) with initial state \((y,y^\prime )=(1,0)\), where k is a free positive constant that accommodates for the position of the first zero of y. This gives a closed form of the f in terms of Airy functions.

Fig. 15
figure 15

Profile of the encoding/decoding of centered disks, with a restricted database. The decoder learns a profile f which only extends to the largest observed radius \(R=18\). Beyond this radius, another profile is learned that is obviously not tuned to any data

Fig. 16
figure 16

Output of the DCGAN of Radford et al. [20] (“IGAN”) for disks when the database is missing disks of certain radii (11–18 pixels). We can see that the DCGAN is not capable of reconstructing the disks which were not observed in the training dataset. This is a clear problem for generalization. In the second, we zoom on the datapoints around the radius zone which is unobserved in the training dataset

Appendix D: Autoencoding Disks with a Database with a Limited Observed Radius (Network with no Biases)

In Fig. 15, we see the gray-levels of the input/output of an autoencoder trained (without biases) on a restricted database, that is to say, a database whose disks have a maximum radius R which is smaller than the image width. We have used \(R=18\) for these experiments. We see that the decoder learns a useful function f which only extends to this maximum radius. Beyond this radius, another function is used corresponding to the other sign of codes (see Proposition 2) that is not tuned.

Appendix E: Autoencoding Disks with a DCGAN [20]

In Fig. 16, we show the autoencoding results of the DCGAN network of Radford et al. We trained their network with a code size of \(d=1\). As can be seen, the DCGAN learns to force the training data to a predefined distribution, which cannot be modified during training (contrary to the autoencoder). Thus, the network fails to correctly autoencode disks in the missing radius region which has not been observed in the training database.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Newson, A., Almansa, A., Gousseau, Y. et al. Processing Simple Geometric Attributes with Autoencoders. J Math Imaging Vis 62, 293–312 (2020). https://doi.org/10.1007/s10851-019-00924-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10851-019-00924-w

Keywords

Navigation