Abstract
We describe generalizations of the universal approximation theorem for neural networks to maps invariant or equivariant with respect to linear representations of groups. Our goal is to establish network-like computational models that are both invariant/equivariant and provably complete in the sense of their ability to approximate any continuous invariant/equivariant map. Our contribution is three-fold. First, in the general case of compact groups we propose a construction of a complete invariant/equivariant network using an intermediate polynomial layer. We invoke classical theorems of Hilbert and Weyl to justify and simplify this construction; in particular, we describe an explicit complete ansatz for approximation of permutation-invariant maps. Second, we consider groups of translations and prove several versions of the universal approximation theorem for convolutional networks in the limit of continuous signals on euclidean spaces. Finally, we consider 2D signal transformations equivariant with respect to the group SE(2) of rigid euclidean motions. In this case we introduce the “charge–conserving convnet”—a convnet-like computational model based on the decomposition of the feature space into isotypic representations of SO(2). We prove this model to be a universal approximator for continuous SE(2)—equivariant signal transformations.
Similar content being viewed by others
Notes
Another approach to ensure a well-defined value \({\varvec{\Phi }}({\mathbf {x}})\) is to work with shift-invariant reproducing kernel Hilbert spaces (RKHS) instead of \(L^2\) spaces. Definition of RKHS requires the signal evaluation \({\varvec{\Phi }}\mapsto {\varvec{\Phi }}({\mathbf {x}})\) to be continuous in \({\varvec{\Phi }}\) and in particular well-defined. An example of a shift-invariant RKHS is the space of band-limited signals with a particular bandwidth. We thank the anonymous reviewer for pointing out this approach.
References
Anselmi, F., Rosasco, L., Poggio, T.: On invariance and selectivity in representation learning. Inf. Inference 5(2), 134–158 (2016)
Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1872–1886 (2013)
Burkhardt, H., Siggelkow, S.: Invariant features in pattern recognition-fundamentals and applications. In: Nonlinear Model-Based Image/Video Processing and Analysis, pp. 269–307 (2001)
Cohen, N., Shashua, A.: Convolutional rectifier networks as generalized tensor decompositions. In: International Conference on Machine Learning, pp. 955–963 (2016)
Cohen, N., Sharir, O., Levine, Y., Tamari, R., Yakira, D., Shashua, A.: Analysis and design of convolutional networks via hierarchical tensor decompositions (2017). arXiv preprint arXiv:1705.02302
Cohen, T., Welling, M.: Group equivariant convolutional networks. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 2990–2999 (2016)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)
Dieleman, S., De Fauw, J., Kavukcuoglu, K.: Exploiting cyclic symmetry in convolutional neural networks. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, vol. 48, pp. 1889–1898 (2016)
Esteves, C., Allen-Blanchette, C., Zhou, X., Daniilidis, K.: Polar transformer networks. In: International Conference on Learning Representations (2018)
Funahashi, K.-I.: On the approximate realization of continuous mappings by neural networks. Neural Netw. 2(3), 183–192 (1989)
Gens, R., Domingos, P.M.: Deep symmetry networks. In: Advances in Neural Information Processing Systems, pp. 2537–2545 (2014)
Goodfellow, I., Bengio, Y.: Deep Learning. MIT Press, Cambridge (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hedlund, G.A.: Endomorphisms and automorphisms of the shift dynamical system. Theory Comput. Syst. 3(4), 320–375 (1969)
Henriques, J.F., Vedaldi, A.: Warped convolutions: efficient invariance to spatial transformations. In: International Conference on Machine Learning, pp. 1461–1469 (2017)
Hilbert, D.: Über die Theorie der algebraischen Formen. Mathematische Annalen 36(4), 473–534 (1890)
Hilbert, D.: Über die vollen Invariantensysteme. Mathematische Annalen 42(3), 313–373 (1893)
Hornik, K.: Some new results on neural network approximation. Neural Netw. 6(8), 1069–1072 (1993)
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Kondor, R., Trivedi, S.: On the generalization of equivariance and convolution in neural networks to the action of compact groups. In: International Conference on Machine Learning, pp. 2747–2755 (2018)
Kraft, H., Procesi, C.: Classical invariant theory, a primer. Lecture Notes (2000)
le Cun, Y.: Generalization and network design strategies. In: Connectionism in Perspective, pp. 143–155 (1989)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6(6), 861–867 (1993)
Mallat, S.: Group invariant scattering. Commun. Pure Appl. Math. 65(10), 1331–1398 (2012)
Mallat, S.: Understanding deep convolutional networks. Philos. Trans. R. Soc. A 374(2065), 20150203 (2016)
Manay, S., Cremers, D., Hong, B.-W., Yezzi, A.J., Soatto, S.: Integral invariants for shape matching. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1602–1618 (2006)
Marcos, D., Volpi, M., Komodakis, N., Tuia, D.: Rotation equivariant vector field networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5048–5057 (2017)
Mhaskar, H.N., Micchelli, C.A.: Approximation by superposition of sigmoidal and radial basis functions. Adv. Appl. Math. 13(3), 350–373 (1992)
Munkres, J.R.: Topology. Featured Titles for Topology Series. Prentice Hall, Upper Saddle River (2000)
Pinkus, A.: TDI-subspaces of \(C({\mathbb{R}}^d)\) and some density problems from neural networks. J. Approx. Theory 85(3), 269–287 (1996)
Pinkus, A.: Approximation theory of the mlp model in neural networks. Acta Numerica 8, 143–195 (1999)
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., Liao, Q.: Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput. 1–17 (2017)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Reisert, M.: Group integration techniques in pattern analysis. Ph.D. thesis, Albert-Ludwigs-University (2008)
Schmid, B.J.: Finite groups and invariant theory. In: Topics in Invariant Theory, pp. 35–66. Springer (1991)
Schulz-Mirbach, H.: Invariant features for gray scale images. In: Mustererkennung 1995, pp. 1–14. Springer (1995)
Serre, J.-P.: Linear Representations of Finite Groups, vol. 42. Springer, Berlin (2012)
Sifre, L., Mallat, S.: Rigid-motion scattering for texture classification (2014). arXiv preprint arXiv:1403.1687
Simon, B.: Representations of Finite and Compact Groups. Number 10. American Mathematical Soc, London (1996)
Skibbe, H.: Spherical tensor algebra for biomedical image analysis. Ph.D. thesis, Albert-Ludwigs-University (2013)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net (2014). arXiv preprint arXiv:1412.6806
Thoma, M.: Analysis and optimization of convolutional neural network architectures. Masters’s thesis, Karlsruhe Institute of Technology, Karlsruhe, Germany, June 2017. https://martin-thoma.com/msthesis/
Vinberg, E.B.: Linear Representations of Groups. Birkhäuser, Basel (2012)
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)
Weyl, H.: The Classical Groups: Their Invariants and Representations. Princeton Mathematical Series (1) (1946)
Worfolk, P.A.: Zeros of equivariant vector fields: algorithms for an invariant approach. J. Symb. Comput. 17(6), 487–511 (1994)
Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Harmonic networks: deep translation and rotation equivariance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037 (2017)
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. In: Advances in Neural Information Processing Systems, pp. 3391–3401 (2017)
Acknowledgements
The author thanks the anonymous reviewer for several helpful suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Wolfgang Dahmen, Ronald A. Devore, and Philipp Grohs.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Proof of Lemma 4.1
Proof of Lemma 4.1
The proof is a slight modification of the standard proof of Central Limit Theorem via Fourier transform (the CLT can be directly used to prove the lemma in the case \(a=b=0\) when \({\mathcal {L}}_{\lambda }^{(a,b)}\) only includes diffusion factors).
To simplify notation, assume without loss of generality that \(d_V=1\) (in the general case the proof is essentially identical). We will use the appropriately discretized version of the Fourier transform (i.e., the Fourier series expansion). Given a discretized signal \( \Phi :(\lambda {\mathbb {Z}})^2\rightarrow {\mathbb {C}}\), we define \({\mathcal {F}}_\lambda \Phi \) as a function on \([-\frac{\pi }{\lambda },\frac{\pi }{\lambda }]^2\) by
Then, \({\mathcal {F}}_\lambda : L^2((\lambda {\mathbb {Z}})^2,{\mathbb {C}})\rightarrow L^2([-\frac{\pi }{\lambda },\frac{\pi }{\lambda }]^2,{\mathbb {C}})\) is a unitary isomorphism, assuming that the scalar product in the input space is defined by \(\langle \Phi ,\Psi \rangle =\lambda ^2\sum _{\gamma \in (\lambda {\mathbb {Z}})^2}\overline{\Phi (\gamma )}\Psi (\gamma )\) and in the output space by \(\langle \Phi ,\Psi \rangle =\int _{[-\frac{\pi }{\lambda },\frac{\pi }{\lambda }]^2} \overline{\Phi ({\mathbf {p}})}\Psi ({\mathbf {p}}) \mathrm{d}^2{\mathbf {p}}\). Let \(P_\lambda \) be the discretization projector (3.6). It is easy to check that \({\mathcal {F}}_\lambda P_\lambda \) strongly converges to the standard Fourier transform as \(\lambda \rightarrow 0:\)
where
and where we naturally embed \(L^2([-\frac{\pi }{\lambda },\frac{\pi }{\lambda }]^2,{\mathbb {C}})\subset L^2({\mathbb {R}}^2,{\mathbb {C}})\). Conversely, let \(P_\lambda '\) denote the orthogonal projection onto the subspace \(L^2([-\frac{\pi }{\lambda },\frac{\pi }{\lambda }]^2,{\mathbb {C}})\) in \(L^2({\mathbb {R}}^2,{\mathbb {C}}):\)
Then
Fourier transform gives us the spectral representation of the discrete differential operators (4.15), (4.16), (4.17) as operators of multiplication by function:
where, denoting \({\mathbf {p}}=(p_x,p_y)\),
The operator \({\mathcal {L}}_\lambda ^{(a,b)}\) defined in (4.19) can then be written as
where the function \(\Psi _{{\mathcal {L}}_\lambda ^{(a,b)}}\) is given by
We can then write \({\mathcal {L}}_\lambda ^{(a,b)}\Phi \) as a convolution of \(P_\lambda \Phi \) with the kernel
on the grid \((\lambda {\mathbb {Z}})^2:\)
Now consider the operator \({\mathcal {L}}_0^{(a,b)}\) defined in (4.21). At each \({\mathbf {x}}\in {\mathbb {R}}^2\), the value \({\mathcal {L}}_0^{(a,b)} \Phi ({\mathbf {x}})\) can be written as a scalar product:
where \({\widetilde{\Phi }}({\mathbf {x}})={\overline{\Phi }}(-{\mathbf {x}})\), \(\Psi _{a,b}\) is defined by (4.20), and \(R_{{\mathbf {x}}}\) is our standard representation of the group \({\mathbb {R}}^2\), \(R_{{\mathbf {x}}}\Phi ({\mathbf {y}})=\Phi ({\mathbf {y}}-{\mathbf {x}})\). For \(\lambda >0\), we can write \({\mathcal {L}}_\lambda ^{(a,b)} \Phi ({\mathbf {x}})\) in a similar form. Indeed, using (A.3) and naturally extending the discretized signal \(\Psi _{a,b}^{(\lambda )}\) to the whole \({\mathbb {R}}^2\), we have
Then, for any \({\mathbf {x}}\in {\mathbb {R}}^2\) we can write
where \(-{\mathbf {x}}+\delta {\mathbf {x}}\) is the point of the grid \((\lambda {\mathbb {Z}})^2\) nearest to \(-{\mathbf {x}}\).
Now consider the formulas (A.4), (A.5) and observe that, by Cauchy-Schwarz inequality and since R is norm-preserving, to prove statement 1) of the lemma we only need to show that the functions \(\Psi _{a,b},\Psi _{a,b}^{(\lambda )}\) have uniformly bounded \(L^2\)-norms. For \(\lambda >0\) we have
where we used the inequalities
Expression (A.6) provides a finite bound, uniform in \(\lambda \), for the squared norms \(\Vert \Psi _{a,b}^{(\lambda )}\Vert ^2\). This bound also holds for \(\Vert \Psi _{a,b}\Vert ^2\).
Next, observe that to establish the strong convergence in statement 2) of the lemma, it suffices to show that
Indeed, by (A.4), (A.5), we would then have
thanks to the unitarity of R, convergence \(\lim _{\delta {\mathbf {x}}\rightarrow 0}\Vert R_{\delta {\mathbf {x}}}{\widetilde{\Phi }}-{\widetilde{\Phi }}\Vert _2=0,\) uniform boundedness of \(\Vert \Psi _{a,b}^{(\lambda )}\Vert _2\) and convergence (A.7).
To establish (A.7), we write
where \(\Psi _{{\mathcal {L}}_0^{(a,b)}}=2\pi {\mathcal {F}}_\lambda \Psi _{a,b}.\) By definition (4.20) of \(\Psi _{a,b}\) and standard properties of Fourier transform, the explicit form of the function \(\Psi _{{\mathcal {L}}_0^{(a,b)}}\) is
Observe that the function \(\Psi _{{\mathcal {L}}_0^{(a,b)}}\) is the pointwise limit of the functions \(\Psi _{{\mathcal {L}}_\lambda ^{(a,b)}}\) as \(\lambda \rightarrow 0\). The functions \(|\Psi _{{\mathcal {L}}_\lambda ^{(a,b)}}|^2\) are bounded uniformly in \(\lambda \) by the integrable function appearing in the integral (A.6). Therefore we can use the dominated convergence theorem and conclude that
where \(P_\lambda '\) is the cut-off projector (A.1). We then have
by (A.8) and (A.2). We have thus proved (A.7).
It remains to show that the convergence \({\mathcal {L}}_\lambda ^{(a,b)} \Phi \rightarrow {\mathcal {L}}_0^{(a,b)} \Phi \) is uniform on compact sets \(K\subset V\). This follows by a version of continuity argument. For any \(\epsilon >0\), we can choose finitely many \(\Phi _n,n=1,\ldots ,N,\) such that for any \(\Phi \in K\) there is some \(\Phi _n\) for which \(\Vert \Phi -\Phi _n\Vert <\epsilon .\) Then \(\Vert {\mathcal {L}}_\lambda ^{(a,b)} \Phi - {\mathcal {L}}_0^{(a,b)} \Phi \Vert \le \Vert {\mathcal {L}}_\lambda ^{(a,b)} \Phi _n- {\mathcal {L}}_0^{(a,b)} \Phi _n\Vert +2\sup _{\lambda \ge 0} \Vert {\mathcal {L}}_\lambda ^{(a,b)}\Vert \epsilon \). Since \(\sup _{\lambda \ge 0} \Vert {\mathcal {L}}_\lambda ^{(a,b)}\Vert <\infty \) by statement 1) of the lemma, the desired uniform convergence for \(\Phi \in K\) follows from the convergence for \(\Phi _n,n=1,\ldots ,N\).
Rights and permissions
About this article
Cite this article
Yarotsky, D. Universal Approximations of Invariant Maps by Neural Networks. Constr Approx 55, 407–474 (2022). https://doi.org/10.1007/s00365-021-09546-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00365-021-09546-1