Skip to main content
Log in

Nonlinear Approximation and (Deep) \(\mathrm {ReLU}\) Networks

  • Published:
Constructive Approximation Aims and scope

Abstract

This article is concerned with the approximation and expressive powers of deep neural networks. This is an active research area currently producing many interesting papers. The results most commonly found in the literature prove that neural networks approximate functions with classical smoothness to the same accuracy as classical linear methods of approximation, e.g., approximation by polynomials or by piecewise polynomials on prescribed partitions. However, approximation by neural networks depending on n parameters is a form of nonlinear approximation and as such should be compared with other nonlinear methods such as variable knot splines or n-term approximation from dictionaries. The performance of neural networks in targeted applications such as machine learning indicate that they actually possess even greater approximation power than these traditional methods of nonlinear approximation. The main results of this article prove that this is indeed the case. This is done by exhibiting large classes of functions which can be efficiently captured by neural networks where classical nonlinear methods fall short of the task. The present article purposefully limits itself to studying the approximation of univariate functions by ReLU networks. Many generalizations to functions of several variables and other activation functions can be envisioned. However, even in this simplest of settings considered here, a theory that completely quantifies the approximation power of neural networks is still lacking.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. By expressivity of a neural network, we mean the collection of functions the network outputs.

  2. Technically, the special networks differ from the usual ReLU networks because they contain ReLU-free neurons, but the set of functions \(\overline{{\underline{\Upsilon }}}^{W,L}\) produced by them is always contained in the standard ReLU network output \( \Upsilon ^{W,L}\), see Remark 3.1.

References

  1. Allaart, P., Kawamura, K.: The Takagi function: a survey. Real Anal. Exchange 37(1), 1–54 (2011-2012)

  2. Bölcskei, H., Grohs, P., Kutyniok, G., Petersen, P.: Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1), 8–45 (2019)

    Article  MathSciNet  Google Scholar 

  3. Bronstein, M., Bruna, J., LeCun, Y., Szlam, A., Vandergheyn, P.: Geometric deep learning: going beyond Euclidean data. IEEE Signal Process. Mag. 34(4), 18–42 (2017)

    Article  Google Scholar 

  4. Chui, C., Li, X., Mhaskar, H.: Neural networks for localized approximation. Math. Comput. 63, 607–623 (1994)

    Article  MathSciNet  Google Scholar 

  5. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. (MCSS) 2(4), 303–314 (1989)

    Article  MathSciNet  Google Scholar 

  6. Daniely, A.: Depth separation for neural networks. Proc. Mach. Learn. Res. (COLT) 65, 690–696 (2017)

    Google Scholar 

  7. DeVore, R.: Nonlinear approximation. Acta Numer. 7, 51–150 (1998)

    Article  Google Scholar 

  8. DeVore, R., Howard, R., Micchelli, C.: Optimal non-linear approximation. Manuscripta Math. 63, 469–478 (1989)

    Article  MathSciNet  Google Scholar 

  9. DeVore, R., Kyriazis, G., Leviatan, D., Tikhomirov, V.M.: Wavelet compression and nonlinear n-widths. Adv. Comput. Math. 1, 197–214 (1993)

    Article  MathSciNet  Google Scholar 

  10. DeVore, R., Lorentz, G.: Constructive Approximation. Springer, Berlin (1993)

    Book  Google Scholar 

  11. Wang, E., Wang, Q.: Exponential convergence of the deep neural network approximation for analytic functions. Sci. China Math. 61, 1733–1740 (2018)

    Article  MathSciNet  Google Scholar 

  12. Grohs, P., Perekerstenko, G., Ebrächter, D., Blöleskei, H.: Deep neural network approximation theory. IEEE Tran. Inf. Theory (2019). arXiv: 1901.02220

  13. Hanin, B., Sellke, M.: Approximating continuous functions by ReLU nets of minimal width (2017). arXiv:1710.11278

  14. Hata, M.: Fractals in Mathematics, Patterns and Waves-Qualitative Analysis of Nonlinear Differential Equations, pp. 259–278. Elsevier, Amsterdam (1986)

  15. Hebb, D.: The Organization of Behavior: A Neuropsychological Theory. Wiley, Hoboken (1949)

    Google Scholar 

  16. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)

    Article  Google Scholar 

  17. Lu, J., Shen, Z., Yang, H., Zhang, S.: Deep network approximation for smooth functions (2020). arXiv:2001.03040

  18. Kainen, P., Kurkova, V., Vogt, A.: Approximation by neural networks is not continuous. Neurocomputing 29, 47–56 (1999)

    Article  Google Scholar 

  19. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. NIPS (2012)

  20. LeCun, Y., Bengio, Y., Hinto, G.: Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  21. Liang, S., Srikant, R.: Why deep neural networks for function approximation? (2016). arXiv:1610.04161

  22. Mehrabi, M., Tchamkerten, A., Yousefi, M.: Bounds on the approximation power of feedforward neural networks. ICML (2018)

  23. Mhaskar, H.N., Poggio, T.: Deep vs. shallow networks: an approximation theory perspective. Anal. Appl. 14, 829–848 (2016)

    Article  MathSciNet  Google Scholar 

  24. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., Liao, Q.: Why and when can deep-but dot shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput. 14(5), 503–519 (2017)

    Article  Google Scholar 

  25. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)

    Article  Google Scholar 

  26. Safran, I., Shamir, O.: Depth-width tradeoffs in approximating natural functions with neural networks (2017). International Conference on Machine Learning. PMLR, 2017

  27. Schwab, C., Zech, J.: Deep learning in high dimension. www.sam.math.ethz.ch/sam-reports/reports- final/reports2017/2017-57-rev2.pdf

  28. Shaham, U., Cloninger, A., Coifman, R.: Provable approximation properties for deep neural networks. Appl. Comput. Harmonic Anal. 44(3), 537–557 (2018)

    Article  MathSciNet  Google Scholar 

  29. Shen, Z., Yang, H., Zhang, S.: Deep network approximation characterized by number of neurons (2019). arXiv:1906.05497

  30. Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

    Article  Google Scholar 

  31. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354 (2017)

    Article  Google Scholar 

  32. Stein, E., Shakarchi, R.: Fourier Analysis. Princeton Lectures in Analysis, vol. 1. Princeton University Press, Princeton (2003)

    MATH  Google Scholar 

  33. Telgarsky, M.: Representation benefits of deep feedforward networks (2015). arXiv:1509.08101

  34. Wu, Y., Schuster, M., Chen, Z., Le, Q., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K. et al.: Google’s neural machine translation system: Bridging the gap between human and machine translatio (2016). arXiv:1609.08144

  35. Yamaguti, M., Hata, M.: Weirstrass’s function and Chaos. Hokkaido. Math. J. 12, 333–342 (1983)

    Article  MathSciNet  Google Scholar 

  36. Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017)

    Article  Google Scholar 

  37. Yarotsky, D.: Quantified advantage of discontinuous weight selection in approximations with deep neural networks (2017). arXiv:1705.01365

  38. Yarotsky, D.: Optimal approximation of continuous functions by very deep ReLU networks, In: Bubeck S, Perchet V, Rigollet P (eds.) Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Resaerch, pp. 639–649. PMLR, 06–09 Jul 2018. arXiv:1802.03620

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. DeVore.

Additional information

Communicated by Wolfgang Dahmen, Ronald A. Devore, and Philipp Grohs.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported by the NSF Grants DMS 18-17603 (RD-GP), DMS 16-22134 (SF), DMS 16-64803 (SF), DMS 1855684 (BH), Tripods Grant CCF-1934904 (RD, SF, GP), ONR Grants N00014-17-1-2908 (RD), N00014-16-1-2706 (RD), N00014-20-1-2787(RD, SF, BH, GP), and the Simons Foundation Math + X Investigator Award 400837 (ID).

Appendix

Appendix

1.1 The Matrices of Lemma 3.3

In order to explicitly write the affine transforms \(A^{(1)}\) and \(A^{(2)}\) that determine the \(\mathrm {ReLU}\) net, we describe here one of the possible ways to partition the set of indices \(\Lambda \) so that the constant sign and separation properties are satisfied. To do this, we first consider \(\Lambda _+\) and only the main breakpoints \(\xi _{j}\) with indices j for which \(j\!\!\mod 3=\ell \). We collect into the set \(\Lambda _i^{\ell ,+}\) all indices \(k\in \Lambda _+\) that correspond to the i-th hat function \(H_{i,j}\) associated with a principal breakpoint \(\xi _{j}\) with the above mentioned property. Recall that there are q hat functions \(H_{i,j}\) associated with each principal breakpoint \(\xi _j\). We do this for every \(\ell =0,1,2\), and \(\Lambda _-\), and we get the partition

$$\begin{aligned} \Lambda _i^{\ell ,+}:= & {} \{s: \, s\in \Lambda _+\,\,\text {and}\,\,\phi _s=H_{i,j}\,\,\text {with}\,\,j \!\!\!\mod 3=\ell \},\\ \Lambda _i^{\ell ,-}:= & {} \{s: \, s \in \Lambda _-\,\,\text {and}\,\,\phi _s=H_{i,j}\,\,\text {with}\,\,j \!\!\!\mod 3=\ell \}, \end{aligned}$$

where \( \ell =0,1,2\), \(i=1,\ldots ,q\). The matrices that determine the special network are

$$\begin{aligned} M^{(1)}= & {} \begin{bmatrix}1&1&\ldots&1&0\end{bmatrix}^\top , \quad b^{(1)}=\begin{bmatrix}0&\xi _1&\ldots&\xi _{W-2}&0\end{bmatrix}^\top , \\ M^{(2)}= & {} \begin{bmatrix}1&{}0&{}\ldots &{}0&{}0 \\ m_{2,1}^{(2)}&{}m_{2,2}^{(2)}&{}\ldots &{}m_{2,W-1}^{(2)}&{}0 \\ \ldots &{}\ldots &{}\\ m_{W-1,1}^{(2)}&{}m_{W-1,2}^{(2)}&{}\ldots &{}m_{W-1,W-1}^{(2)}&{}0 \\ 0&{}0&{}\ldots &{}0&{}1 \end{bmatrix}, \quad b^{(2)}=\begin{bmatrix}0\\ b_2^{(2)}\\ \ldots \\ b_{W-2}^{(2)}\\ 0\end{bmatrix} \\ M^{(3)}= & {} \begin{bmatrix}0&\varepsilon ^{(3)}_1&\ldots&\varepsilon ^{(3)}_{W-2}&1\end{bmatrix},\quad b^{(3)}=0, \end{aligned}$$

where \(\varepsilon ^{(3)}_k=1\) if \(\Lambda _k\subset \Lambda _+\), \(\varepsilon ^{(k)}_k=-1\) if \(\Lambda _k\subset \Lambda _-\), and \(\varepsilon ^{(3)}_k=0\) if \(\Lambda _k=\emptyset \). Next, we demonstrate how to find the entries of one row in \(M^{(2)}\). The rest of the rows are computed likewise. The index \(k=1,\ldots ,W-2\) in (10) corresponds to a different labeling of the index set

$$\begin{aligned} \{(i,\ell ,+), (i,\ell ,-),\,i=1,\ldots ,q, \, \ell =0,1,2\}, \end{aligned}$$

of the particular partition we work with here. We take the index \((1,1,+)\) and compute the corresponding \({{\tilde{T}}}\),

$$\begin{aligned} {{\tilde{T}}}:=T_{(1,1+)}=\sum _{s\in \Lambda _1^{1,+}}c_s\phi _s=[\tilde{S}]_+, \end{aligned}$$

see Fig. 6, where \({{\tilde{S}}}\) is a CPwL function with breakpoints the principal breakpoints \(\xi _1,\ldots ,\xi _{W-2}\), with the property

$$\begin{aligned} {{\tilde{S}}}(\xi _{{3}s+1})=c_{{ 3}s+1}, \quad {{\tilde{S}}}(x_{({ 3}s+1)q-1})= {{\tilde{S}}}(x_{({3}s+1)q+1})=0, \quad s=0,\ldots ,\left\lfloor { \frac{W-3}{3}}\right\rfloor . \end{aligned}$$

Then, the entries in the second row in \(M^{(2)}\) and \(b^{(2)}\) are the coefficients from the representation

$$\begin{aligned} \tilde{S}(x)=m_{2,1}^{(2)}x+\sum _{j=2}^{W-2}m_{2,j}^{(2)}(x-\xi _j)_++b_2^{(2)}. \end{aligned}$$

1.2 Theorem 3.1, Case \(4\le W\le 7\)

In this case, we have to show that for every \(n\ge 1\) the set \(\Sigma _n\) of free knot linear splines with n breakpoints is contained in the set \(\Upsilon ^{W,L}\) of functions produced by width-W and depth-L \(\mathrm {ReLU}\) networks where

$$\begin{aligned} L= {\left\{ \begin{array}{ll} 2\left\lceil \frac{n}{2(W-2)}\right\rceil , \quad &{} n\ge 2(W-2), \\ 2, &{} n<2(W-2), \end{array}\right. } \end{aligned}$$

and whose number of parameters

$$\begin{aligned} n(W,L)\le {\left\{ \begin{array}{ll} Cn, \quad &{} n\ge 2(W-2), \\ W^2+4W+1, &{} n< 2(W-2), \end{array}\right. } \end{aligned}$$

where C is an absolute constant. We start with the case \(W-2=2\). Given \(n\ge 4\), we choose \(L:=\lceil \frac{n}{4}\rceil \). If \(n<4L\), we add artificial breakpoints so that we represent \(T\in \Sigma _n\subset \Sigma _{4L}\) as

$$\begin{aligned}&T(x)=ax+b+\sum _{j=1}^{4L}c_j(x-\xi _j)_+=ax+b+\sum _{j=1}^{2L}S_j, \\&S_j:=c_{2j-1}(x-\xi _{2j-1})_++c_{2j}(x-\xi _{2j})_+. \end{aligned}$$

Now we can construct the special network with output \(\overline{{\underline{\Upsilon }}}^{4,2L}\) that generates T via the successive transformations \(A^{(j)}\) given by the matrices

$$\begin{aligned} M^{(1)}=\begin{bmatrix}1&1&1&0\end{bmatrix}^\top , \quad b^{(1)}=\begin{bmatrix}0&-\xi _1&-\xi _2&0\end{bmatrix}^\top , \end{aligned}$$

The jth layer, \(j=2,\ldots ,2L\), produces \(S_{j-1}\) in its CC node via the matrix

$$\begin{aligned} M^{(j)}=\begin{bmatrix}1&{}0&{}0&{}0 \\ 1&{}0&{}0&{}0 \\ 1&{}0&{}0&{}0 \\ 0&{}c_{2j-3}&{}c_{2j-2}&{}1 \end{bmatrix}, \quad b^{(j)}=\begin{bmatrix}0\\ xi_{2j-1}\\ xi_{2j}\\ 0\end{bmatrix}. \end{aligned}$$

Finally, the output layer is given by the matrix

$$\begin{aligned} M^{(2L)}=\begin{bmatrix}a&c_{2L-1}&c_{2L}&1\end{bmatrix},\quad b^{(2L)}=b, \end{aligned}$$

where the first entry a and the bias b account for the linear function \(ax+b\) in T. In this case, we have \(\Sigma _{4L}\subset \overline{{\underline{\Upsilon }}}^{4,2L}\subset \Upsilon ^{4,2L}\), with number of parameters

$$\begin{aligned} n(4,2L)=40L-7=40\left\lceil \frac{n}{4}\right\rceil -7<10n+33<19n, \quad n\ge 4. \end{aligned}$$

For the case \(n<4\), we again add artificial breakpoints so that we represent \(T\in \Sigma _n\subset \Sigma _{4}\) as

$$\begin{aligned}&T(x)=ax+b+\sum _{j=1}^{4}c_j(x-\xi _j)_+=ax+b+\sum _{j=1}^{2}S_j, \\&S_j:=c_{2j-1}(x-\xi _{2j-1})_++c_{2j}(x-\xi _{2j})_+, \end{aligned}$$

and as above construct a special network with output \(\overline{{\underline{\Upsilon }}}^{4,2}\) for which \(\Sigma _n\subset \overline{{\underline{\Upsilon }}}^{4,2}\), and whose parameters

$$\begin{aligned} n(4,2)=33=W^2+4W+1, \quad W=4. \end{aligned}$$

Now, for the case \((W-2)\in \{3,4,5\}\), let us first consider \(n\ge 2(W-2)\) and take \(L:=\left\lceil \frac{n}{2(W-2)}\right\rceil \). If \(n<2(W-2)L\), we add artificial breakpoints so that we represent \(T\in \Sigma _n\subset \Sigma _{2(W-2)L}\). We do the same construction as in the case \(W-2=2\), by dividing the indices \(\{1,\ldots ,2(W-2)L\}\) into 2L groups of \(W-2\) numbers, as shown in

$$\begin{aligned}&T(x)=ax+b+\sum _{j=1}^{2(W-2)L}c_j(x-\xi _j)_+=ax+b+\sum _{j=1}^{2L}S_j, \\&S_j:=\sum _{i=0}^{W-3}c_{(W-2)j-i}(x-\xi _{(W-2)j-i})_+, \end{aligned}$$

and execute the same construction as before by concatenating the networks producing \(S_j\). In this case, we have \(\Sigma _n\subset \Sigma _{2(W-2)L}\subset \overline{{\underline{\Upsilon }}}^{W,2L}\), and when \(n\ge 2(W-2)\),

$$\begin{aligned} n(W,2L)= & {} 2W(W+1)\left\lceil \frac{n}{2(W-2)}\right\rceil -(W-1)^2+2\\< & {} \frac{W(W+1)}{W-2}n+W^2+4W+1<25n. \end{aligned}$$

When \(n<2(W-2)\), we again add artificial breakpoints so that we represent \(T\in \Sigma _n\subset \Sigma _{2(W-2)}\) as

$$\begin{aligned}&T(x)=ax+b+\sum _{j=1}^{2(W-2)}c_j(x-\xi _j)_+=ax+b+\sum _{j=1}^{2}S_j, \\&S_j:=\sum _{i=0}^{W-3}c_{(W-2)j-i}(x-\xi _{(W-2)j-i})_+, \end{aligned}$$

and as above generate a special network that outputs \(\overline{{\underline{\Upsilon }}}^{W,2}\) with depth \(L=2\) for which \(\Sigma _n\subset \overline{{\underline{\Upsilon }}}^{W,2}\), and whose parameters

$$\begin{aligned} n(W,2)=2W(W+1)-(W-1)^2+2=W^2+4W+1, \quad n<2(W-2). \end{aligned}$$

This completes the proof. \(\square \)

1.3 Proof of Theorem 4.1

Proof

Note that for every k-tuple \(({{\tilde{S}}}_k,\cdots ,{{\tilde{S}}}_1)\in \Sigma _{n_k}\times \cdots \times \Sigma _{n_1}\), we can find another k-tuple \((S_k,\ldots ,S_1)\in \Sigma _{n_k}\times \cdots \times \Sigma _{n_1}\), which we call a representative of the composition, with the properties:

  • \(S_j([0,1])\subset [0,1]\), \(j=1,\ldots ,k-1\).

  • \({{\tilde{S}}}_k\circ \cdots \circ {{\tilde{S}}}_1=S_k\circ \cdots \circ S_1\).

Indeed, if we denote by \(m_1:=\min _{x\in [0,1]}{{\tilde{S}}}_1(x)\), \(M_1:=\max _{x\in [0,1]}{{\tilde{S}}}_1(x)\), define inductively

$$\begin{aligned} m_j:=\min _{x\in [m_{j-1},M_{j-1}]}{{\tilde{S}}}_j, \quad M_j:=\max _{x\in [m_{j-1},M_{j-1}]}{{\tilde{S}}}_j, \quad j=2,\ldots ,k-1, \end{aligned}$$

and consider the functions

$$\begin{aligned} S_1:= & {} \frac{{{\tilde{S}}}_1-m_1}{M_1-m_1}\in \Sigma _{n_1},\\ S_j{ (x)}:= & {} \frac{{{\tilde{S}}}_j(x(M_{j-1}-m_{j-1})+m_{j-1})-{ m_{j}}}{M_j-m_j}\in \Sigma _{n_j}, \quad j=2,\ldots ,k-1,\\ S_k{ (x)}:= & {} {{\tilde{S}}}_k(x(M_{k-1}-m_{ k-1})+m_{k-1}). \end{aligned}$$

The k-tuple \((S_k,\ldots ,S_1)\) will be a representative of the composition \({{\tilde{S}}}_k\circ \ldots \circ {{\tilde{S}}}_1\). So, in going further, we will always assume that we are dealing with representatives of all compositions we consider and with \(\mathrm {ReLU}\) networks that output these representatives.

Relation (11) follows from Proposition 4.2 and Theorem 3.1. Indeed, if we fix an element in \(\Sigma ^{n_k\circ \cdots \circ n_1}:=\{\tilde{S}_k\circ \cdots \circ {{\tilde{S}}}_1:\,{{\tilde{S}}}_j\in \Sigma _{n_j}, j=1,\ldots ,k\}\) and consider its representative \((S_k,\ldots ,S_1)\), each \(S_j\) in the composition \(S_k\circ \cdots \circ S_1\) can be produced by a \(\mathrm {ReLU}\) network with width W and depth

$$\begin{aligned} L_j=2\left\lceil \frac{n_j}{\lfloor \frac{W-2}{6}\rfloor (W-2)}\right\rceil , \end{aligned}$$

and therefore, part (i) of Proposition 4.2 ensures that \(S_k\circ \cdots \circ S_1\in \Upsilon ^{W,\sum _{j=1}^kL_j}\). A similar estimate as in the proof of Theorem 3.1 yields

$$\begin{aligned} n(W,L)< & {} 34\sum _{j=1}^kn_j+2k(W^2+W), \end{aligned}$$

as desired.

To establish (12), for each \(i=1,\ldots ,m\), let us denote by \({{{\mathcal {N}}}}_i\) the \(\mathrm {ReLU}\) network from (11) with width \(W-2\) that produces the composition \(S_{i,\ell _i}\circ \cdots \circ S_{i,1}\) and has depth

$$\begin{aligned} L_i=L(n_{i,\ell _i},\ldots ,n_{i,1})=2\sum _{j=1}^{\ell _i}\left\lceil \frac{n_{i,j}}{\lfloor \frac{W-4}{6}\rfloor (W-4)}\right\rceil . \end{aligned}$$

Then, Proposition 4.2, part (ii) gives

$$\begin{aligned} S=\sum _{i=1}^ma_iS_{i,\ell _i}\circ \cdots \circ S_{i,1} \in \overline{{\underline{\Upsilon }}}^{W,L}, \end{aligned}$$

with

$$\begin{aligned} L=\sum _{i=1}^mL_i=2\sum _{i=1}^m\sum _{j=1}^{\ell _i}\left\lceil \frac{n_{i,j}}{\lfloor \frac{W-4}{6}\rfloor (W-4)}\right\rceil . \end{aligned}$$

A similar estimate as in the proof of Theorem 3.1 yields

$$\begin{aligned} n(W,L)< & {} 44\sum _{i=1}^m\sum _{j=1}^{\ell _i}n_{i,j}+2W(W+1)\sum _{i=1}^m\ell _i. \end{aligned}$$

As discussed in Remark 3.1, \(\overline{{\underline{\Upsilon }}}^{W,L}\) can always be viewed as a subset of \(\Upsilon ^{W,L}\), and the proof is completed. \(\square \)

1.4 Proof of Proposition 6.1

Let us first start with the notation

$$\begin{aligned} {\mathbf{1}}_{\{i=j \}}:= {\left\{ \begin{array}{ll} 1, \quad &{} i=j, \\ 0, &{} i\ne j, \end{array}\right. } \end{aligned}$$

and isolate the following technical observation.

Lemma 9.1

For any nonnegative sequence \(u \in \ell _2({\mathbb {N}})\),

$$\begin{aligned} \sum _{\begin{array}{c} k,\ell \ge 1\\ k \not = \ell \end{array} } u_k u_\ell \sum _{m,n\ge 0} \frac{1}{(2m+1)^2} \frac{1}{(2n+1)^2} {\mathbf{1}}_{\{(2m+1)k = (2n+1)\ell \}} \le \frac{\pi ^4}{192} \Vert u\Vert _2^2. \end{aligned}$$
(30)

Proof

For each integer \(m \ge 0\), let us introduce the sequence \(u^{(2m+1)} \in \ell _2({\mathbb {N}})\) defined by

$$\begin{aligned} u^{(2m+1)}_j = \left\{ \begin{matrix} u_{\frac{j}{2m+1}}, &{} \text{ if } j \in (2m+1){\mathbb {N}},\\ 0, &{} \text{ if } j \not \in (2m+1){\mathbb {N}}, \end{matrix} \right. \end{aligned}$$

i.e., we consider a new sequence obtained from the original one by separating every two consecutive terms with 2m zeros, starting with 2m zeros. We easily see that

$$\begin{aligned} \langle u^{(2m+1)}, u^{(2n+1)} \rangle= & {} \sum _{j \in {\mathbb {N}}} u^{(2m+1)}_j u^{(2n+1)}_j =\sum _{k,\ell \in {\mathbb {N}}} u_k u_\ell {\mathbf{1}}_{\{ (2m+1)k = (2n+1)\ell \}}, \end{aligned}$$

and in particular \(\Vert u^{(2m+1)}\Vert _2^2=\Vert u\Vert _2^2\) for every \(m \ge 0\). Thus, the left-hand side of (30), which we denote by \(\Sigma \), can be written as

$$\begin{aligned} \Sigma= & {} \sum _{\begin{array}{c} m,n\ge 0\\ m \not = n \end{array}} \frac{1}{(2m+1)^2} \frac{1}{(2n+1)^2} \sum _{k,\ell \ge 1 } u_k u_\ell {\mathbf{1}}_{\{(2m+1)k = (2n+1)\ell \}}\nonumber \\= & {} \sum _{\begin{array}{c} m,n\ge 0\\ m \not = n \end{array}} \frac{1}{(2m+1)^2} \frac{1}{(2n+1)^2} \langle u^{(2m+1)}, u^{(2n+1)} \rangle \nonumber \\= & {} \bigg \Vert \sum _{m \ge 0} \frac{1}{(2m+1)^2} u^{(2m+1)} \bigg \Vert _2^2 - \sum _{m \ge 0} \frac{1}{(2m+1)^4} \Vert u\Vert _2^2. \end{aligned}$$
(31)

By a simple triangle inequality, we have, see [32], Chapter 2, Exercise 6,

$$\begin{aligned} \bigg \Vert \sum _{m \ge 0} \frac{1}{(2m+1)^2} u^{(2m+1)} \bigg \Vert _2 \le \sum _{m \ge 0} \frac{1}{(2m+1)^2} \Vert u\Vert _2 = \frac{\pi ^2}{8} \Vert u\Vert _2. \end{aligned}$$
(32)

Moreover, it is well-known that see [32], Chapter 3, Exercise 8(a),

$$\begin{aligned} \sum _{m \ge 0} \frac{1}{(2m+1)^4} = \frac{\pi ^4}{96}. \end{aligned}$$
(33)

Substituting (32) and (33) into (31) yields the announced result. \(\square \)

Proof of Proposition 6.1

We equivalently prove the result for the \(L_2\)-normalized version of the system \(({\mathcal {C}}_k,{\mathcal {S}}_k)_{k \ge 1}\), i.e., for \((\widetilde{{\mathcal {C}}}_k := \sqrt{3} \ {\mathcal {C}}_k,\widetilde{{\mathcal {S}}}_k := \sqrt{3} \ {\mathcal {S}}_k)_{k \ge 1}\). Let \((c_k,s_k)_{k \ge 1}\) denote the orthonormal basis for \(L_2^0[0,1]\) made of the usual trigonometric functions

$$\begin{aligned} c_k(x) = \sqrt{2} \cos (2 \pi k x), \qquad s_k(x) = \sqrt{2} \sin (2 \pi k x), \qquad x \in [0,1]. \end{aligned}$$

It is routine to verify (by computing Fourier series) that

$$\begin{aligned} {{{\mathcal {C}}}} = \lambda \sum _{m \ge 0} \frac{1}{(2m+1)^2} c_{2m+1}, \qquad {{{\mathcal {S}}}} = \lambda \sum _{m \ge 0} \frac{(-1)^m}{(2m+1)^2} s_{2m+1}, \end{aligned}$$

for some constant \(\lambda >0\), from which one immediately obtains that, for any \(k \ge 1\),

$$\begin{aligned} \widetilde{{{\mathcal {C}}}}_k = \mu \sum _{m \ge 0} \frac{1}{(2m+1)^2} c_{(2m+1)k}, \qquad \widetilde{{{\mathcal {S}}}}_k = \mu \sum _{m \ge 0} \frac{(-1)^m}{(2m+1)^2} s_{(2m+1)k}, \end{aligned}$$

for some constant \(\mu >0\). Notice that this implies \(\widetilde{{{\mathcal {C}}}}_k \perp s_\ell \), \(\widetilde{{{\mathcal {S}}}}_k \perp c_\ell \), and \(\widetilde{{{\mathcal {C}}}}_k \perp \widetilde{{{\mathcal {S}}}}_\ell \) for all \(k,\ell \ge 1\). Moreover, the normalization \(\Vert \widetilde{{{\mathcal {C}}}}_k\Vert _{L_2[0,1]} = \Vert \widetilde{\mathcal{S}}_k\Vert _{L_2[0,1]} = 1\) gives

$$\begin{aligned} \mu ^2 \sum _{m \ge 0} \frac{1}{(2m+1)^4} = 1, \qquad \text{ i.e., } \qquad \mu ^2 \frac{\pi ^4}{96}=1. \end{aligned}$$

Let us introduce operators \(T_{\mathcal {C}}, T_{{\mathcal {S}}}\) defined for \(v \in \ell _2({\mathbb {N}})\) and \(j \in {\mathbb {N}}\), by

$$\begin{aligned} T_{{\mathcal {C}}}(v)_j= & {} \sum _{k \ge 1} v_k \langle \widetilde{{\mathcal {C}}}_k,c_j \rangle = \mu \sum _{k \ge 1} v_k \sum _{m \ge 0} \frac{1}{(2m+1)^2} {\mathbf{1}}_{ \{ (2m+1)k = j \} }, \\ T_{{\mathcal {S}}}(v)_j= & {} \sum _{k \ge 1} v_k \langle \widetilde{{\mathcal {S}}}_k,s_j \rangle = \mu \sum _{k \ge 1} v_k \sum _{m \ge 0} \frac{(-1)^m}{(2m+1)^2} {\mathbf{1}}_{ \{ (2m+1)k = j \} }, \end{aligned}$$

and let us first verify that these are well-defined operators from \(\ell _2({\mathbb {N}})\) to \(\ell _2({\mathbb {N}})\), i.e., that both \(\Vert T_{{\mathcal {C}}}v\Vert _2\) and \(\Vert T_{{\mathcal {S}}}v\Vert _2\) are finite when \(v \in \ell _2({\mathbb {N}})\). To do so, we observe that

$$\begin{aligned} \Vert T_{{\mathcal {C}}} v\Vert _2^2= & {} \mu ^2 \sum _{j \ge 1} \sum _{k,\ell \ge 1} v_k v_\ell \sum _{m,n \ge 0} \frac{1}{(2m+1)^2} \frac{1}{(2n+1)^2} {\mathbf{1}}_{ \{ (2m+1)k = j \} } {\mathbf{1}}_{ \{ (2n+1)\ell = j \} }\\= & {} \Sigma _{(=)} + \Sigma _{(\not =)}, \end{aligned}$$

where \(\Sigma _{(=)}\) represents the contribution to the sum when k and \(\ell \) are equal and \(\Sigma _{(\not =)}\) represents the contribution to the sum when k and \(\ell \) are distinct. We notice that

$$\begin{aligned} \Sigma _{(=)} = \sum _{k\ge 1} v_k^2 \, \mu ^2 \sum _{m\ge 0} \frac{1}{(2m+1)^4} \sum _{j \ge 1} {\mathbf{1}}_{ \{ (2m+1)k = j \} } = \sum _{k\ge 1} v_k^2 \, \mu ^2 \sum _{m\ge 0} \frac{1}{(2m+1)^4} =\sum _{k\ge 1} v_k^2. \end{aligned}$$

Therefore, relying on Lemma 9.1, we obtain

$$\begin{aligned}&\left| \Vert T_{{\mathcal {C}}} v\Vert _2^2 - \Vert v\Vert _2^2 \right| \nonumber \\&\quad = \left| \Sigma _{(\not =)} \right| \le \mu ^2 \sum _{\begin{array}{c} k,\ell \ge 1\\ k \not = \ell \end{array}} |v_k| |v_\ell | \sum _{m,n \ge 0} \frac{1}{(2m+1)^2} \frac{1}{(2n+1)^2} \sum _{j \ge 1} {\mathbf{1}}_{ \{ (2m+1)k = j \} } {\mathbf{1}}_{ \{ (2n+1)\ell = j \} }\nonumber \\&\quad = \mu ^2 \sum _{\begin{array}{c} k,\ell \ge 1\\ k \not = \ell \end{array}} |v_k| |v_\ell | \sum _{m,n \ge 0} \frac{1}{(2m+1)^2} \frac{1}{(2n+1)^2} {\mathbf{1}}_{ \{ (2m+1)k = (2n+1)\ell \} } \nonumber \\&\quad \le \mu ^2 \frac{\pi ^4}{192} \Vert v\Vert _2^2 = \frac{1}{2} \Vert v\Vert _2^2. \end{aligned}$$
(34)

This clearly justifies that \(\Vert T_{{\mathcal {C}}} v\Vert _2^2 < \infty \), and \(\Vert T_{{\mathcal {S}}} v\Vert _2^2 < \infty \) is verified in a similar fashion. In fact, the inequality (34) and the analogous one for \(T_{{\mathcal {S}}}\) show that

$$\begin{aligned} \Vert T_{{\mathcal {C}}}^* T_{{\mathcal {C}}} - I \Vert _{2 \rightarrow 2} = \max _{\Vert v\Vert _2 = 1} | \langle v, (T_{{\mathcal {C}}}^* T_{{\mathcal {C}}} - I) v \rangle | \le \frac{1}{2}, \qquad \Vert T_{{\mathcal {S}}}^* T_{{\mathcal {S}}} - I \Vert _{2 \rightarrow 2} \le \frac{1}{2}. \end{aligned}$$
(35)

This ensures that the operators \(T_{{\mathcal {C}}}^* T_{{\mathcal {C}}}\) and \(T_{{\mathcal {S}}}^* T_{{\mathcal {S}}}\) are invertible. Let us assume for a while that the operators \(T_{{\mathcal {C}}} T_{{\mathcal {C}}}^*\) and \(T_{{\mathcal {S}}} T_{{\mathcal {S}}}^*\) are also invertible. Then we derive that \(T_{{\mathcal {C}}}\) is invertible with inverse \((T_{{\mathcal {C}}}^* T_{{\mathcal {C}}})^{-1} T_{{\mathcal {C}}}^*\), since \((T_{{\mathcal {C}}}^* T_{{\mathcal {C}}})^{-1} T_{{\mathcal {C}}}^* T_{{\mathcal {C}}} = I \) is obvious and \(T_{{\mathcal {C}}} (T_{\mathcal C}^* T_{{\mathcal {C}}})^{-1} T_{{\mathcal {C}}}^* = I\) is equivalent, by the invertibility of \(T_{{\mathcal {C}}} T_{{\mathcal {C}}}^*\), to \(T_{{\mathcal {C}}} T_{{\mathcal {C}}}^* T_{{\mathcal {C}}} (T_{{\mathcal {C}}}^* T_{{\mathcal {C}}})^{-1} T_{{\mathcal {C}}}^* = T_{{\mathcal {C}}} T_{\mathcal C}^*\), which is obvious. We derive that \(T_{{\mathcal {S}}}\) is invertible in a similar fashion. From here, we can show that the system \((\widetilde{{\mathcal {C}}}_k, \widetilde{{\mathcal {S}}}_k)_{k \ge 1}\) spans \(L_{2}^0[0,1]\). Indeed, we claim that any \(f \in L_{2}^0[0,1]\) can be written, with \(\alpha := (\langle f, c_j \rangle )_{j \ge 1}\) and \(\beta := (\langle f, s_j \rangle )_{j \ge 1}\), as

$$\begin{aligned} f = \sum _{k \ge 1} (T_{{\mathcal {C}}}^{-1} \alpha )_k \widetilde{{\mathcal {C}}}_k + \sum _{k \ge 1} (T_{{\mathcal {S}}}^{-1} \beta )_k \widetilde{{\mathcal {S}}}_k. \end{aligned}$$

This identity is verified by taking the inner product of partial sums with the \(c_j\) and \(s_j\). Indeed,

$$\begin{aligned} \left\langle f - \sum _{k=1}^K (T_{{\mathcal {C}}}^{-1} \alpha )_k \widetilde{{\mathcal {C}}}_k - \sum _{k=1}^K (T_{{\mathcal {S}}}^{-1} \beta )_k \widetilde{{\mathcal {S}}}_k , c_j \right\rangle= & {} \langle f, c_j \rangle - \sum _{k=1}^K (T_{{\mathcal {C}}}^{-1} \alpha )_k \langle \widetilde{{\mathcal {C}}}_k, c_j \rangle - 0\\= & {} \left( T_{{\mathcal {C}}} (T_{{\mathcal {C}}}^{-1} \alpha )\right) _j - \left( T_{{\mathcal {C}}} (T_{{\mathcal {C}}}^{-1} \alpha )_{\{1,\ldots ,K\}}\right) _j\\= & {} \left( T_{{\mathcal {C}}} (T_{{\mathcal {C}}}^{-1} \alpha )_{\{K+1,\ldots \}}\right) _j. \end{aligned}$$

After a similar calculation with \(s_j\), and in view of \(\Vert T_{{\mathcal {C}}} \Vert _{2 \rightarrow 2}^2 = \Vert T_{{\mathcal {C}}}^* T_{{\mathcal {C}}}\Vert _{2 \rightarrow 2} \le 3/2\), it follows that

$$\begin{aligned}&\left\| f - \sum _{k=1}^K (T_{{\mathcal {C}}}^{-1} \alpha )_k \widetilde{{\mathcal {C}}}_k - \sum _{k=1}^K (T_{{\mathcal {S}}}^{-1} \beta )_k \widetilde{{\mathcal {S}}}_k \right\| _{L_2[0,1]}^2 \\&\quad \le \left\| T_{{\mathcal {C}}} (T_{{\mathcal {C}}}^{-1} \alpha )_{\{K+1,\ldots \}} \right\| _2^2 + \left\| T_{{\mathcal {S}}} (T_{{\mathcal {S}}}^{-1} \alpha )_{\{K+1,\ldots \}} \right\| _2^2\\&\quad = \frac{3}{2} \left( \left\| (T_{{\mathcal {C}}}^{-1} \alpha )_{\{K+1,\ldots \}} \right\| _2^2 + \left\| (T_{{\mathcal {S}}}^{-1} \alpha )_{\{K+1,\ldots \}} \right\| _2^2 \right) \\&\quad \underset{K \rightarrow \infty }{\longrightarrow } 0, \end{aligned}$$

which confirms our claim. As for a normalized version of (17), it follows from (35) by noticing that

$$\begin{aligned} \bigg \Vert \sum _{k\ge 1} (a_k \widetilde{{{\mathcal {C}}}}_k + b_k \widetilde{{{\mathcal {S}}}}_k) \bigg \Vert _{L_2[0,1]}^2 - (\Vert a \Vert _2^2 + \Vert b \Vert _2^2)= & {} \bigg \Vert \sum _{k\ge 1} a_k \widetilde{{{\mathcal {C}}}}_k \bigg \Vert _{L_2[0,1]}^2 - \Vert a \Vert _2^2 \\&+ \bigg \Vert \sum _{k\ge 1} b_k \widetilde{{{\mathcal {S}}}}_k \bigg \Vert _{L_2[0,1]}^2 - \Vert b \Vert _2^2, \end{aligned}$$

combined with the observation that

$$\begin{aligned} \bigg | \bigg \Vert \sum _{k\ge 1} a_k \widetilde{{{\mathcal {C}}}}_k \bigg \Vert _{L_2[0,1]}^2 - \Vert a \Vert _2^2 \bigg |= & {} \bigg | \sum _{j \ge 1} \bigg ( \sum _{k \ge 1} a_k \left\langle \widetilde{{\mathcal {C}}}_k , c_j \right\rangle \bigg )^2 - \Vert a\Vert _2^2 \bigg | = \bigg | \sum _{j \ge 1} (T_{{\mathcal {C}}}a)_j^2 - \Vert a\Vert _2^2 \bigg |\\= & {} \big | \Vert T_{{\mathcal {C}}} a\Vert _2^2 - \Vert a\Vert _2^2 \big | = \big | \langle (T_{{\mathcal {C}}}^* T_{{\mathcal {C}}} -I) a, a \rangle \big | \le \frac{1}{2} \Vert a\Vert _2^2, \end{aligned}$$

and the similar observation that

$$\begin{aligned} \bigg | \bigg \Vert \sum _{k\ge 1} b_k \widetilde{{{\mathcal {S}}}}_k \bigg \Vert _{L_2[0,1]}^2 - \Vert b \Vert _2^2 \bigg | \le \frac{1}{2} \Vert b\Vert _2^2. \end{aligned}$$

We deduce that a normalized version of (17) holds with constants \({\widetilde{c}}=1/2\) and \({\widetilde{C}}=3/2\), hence (17) holds with \(c=1/6\) and \(C=1/2\).

It now remains to establish that the operators \(T_{{\mathcal {C}}} T_{{\mathcal {C}}}^*\) and \(T_{{\mathcal {S}}} T_{{\mathcal {S}}}^*\) are invertible, which we do by showing that

$$\begin{aligned} \Vert T_{{\mathcal {C}}} T_{{\mathcal {C}}}^* - I \Vert _{2 \rightarrow 2} \le \rho \qquad \text{ and } \qquad \Vert T_{{\mathcal {S}}} T_{{\mathcal {S}}}^* - I \Vert _{2 \rightarrow 2} \le \rho \end{aligned}$$
(36)

for some constant \(\rho < 1\). We concentrate on the case of \(T_{{\mathcal {C}}}\), as the case of \(T_{{\mathcal {S}}}\) is handled similarly. We first remark that the adjoint of \(T_{{\mathcal {C}}}\) is given, for any \(v \in \ell _2({\mathbb {N}})\) and \(j \in {\mathbb {N}}\), by

$$\begin{aligned} T_{{\mathcal {C}}}^*(v)_j = \sum _{k \ge 1} v_k \langle \widetilde{{\mathcal {C}}}_j,c_k \rangle = \mu \sum _{k \ge 1} v_k \sum _{m \ge 0} \frac{1}{(2m+1)^2} {\mathbf{1}}_{ \{ (2m+1)j = k \} }. \end{aligned}$$

We then compute

$$\begin{aligned} \Vert T_{{\mathcal {C}}}^* v\Vert _2^2= & {} \mu ^2 \sum _{j \ge 1} \sum _{k,\ell \ge 1} v_k v_\ell \sum _{m,n \ge 0} \frac{1}{(2m+1)^2} \frac{1}{(2n+1)^2} {\mathbf{1}}_{ \{ (2m+1)j = k \} } {\mathbf{1}}_{ \{ (2n+1)j = \ell \} }\\= & {} \Sigma _{(=)}^* + \Sigma _{(\not =)}^*, \end{aligned}$$

where \(\Sigma _{(=)}^*\) represents the contribution to the sum when k and \(\ell \) are equal and \(\Sigma _{(\not =)}^*\) represents the contribution to the sum when k and \(\ell \) are distinct. We notice that

$$\begin{aligned} \Sigma _{(=)}^* = \sum _{k \ge 1} v_k^2 \, \mu ^2 \sum _{m \ge 0} \frac{1}{(2m+1)^4} \sum _{j \ge 1} {\mathbf{1}}_{ \{ (2m+1)j = k \} } \end{aligned}$$

satisfies, on the one hand,

$$\begin{aligned} \Sigma _{(=)}^* \le \sum _{k \ge 1} v_k^2 \, \mu ^2 \sum _{m \ge 0} \frac{1}{(2m+1)^4} = \sum _{k \ge 1} v_k^2 = \Vert v\Vert _2^2, \end{aligned}$$

and on the other hand, by considering only the summand for \(m=0\) and \(j=k\),

$$\begin{aligned} \Sigma _{(=)}^* \ge \sum _{k \ge 1} v_k^2 \, \mu ^2 = \mu ^2 \Vert v\Vert _2^2. \end{aligned}$$

Moreover, we have

$$\begin{aligned} \left| \Sigma _{(\not =)}^{ *} \right|\le & {} \mu ^2 \sum _{\begin{array}{c} k,\ell \ge 1\\ k \not = \ell \end{array}} |v_k| |v_\ell | \sum _{m,n \ge 0} \frac{1}{(2m+1)^2} \frac{1}{(2n+1)^2} \sum _{j \ge 1} {\mathbf{1}}_{ \{ (2m+1)j = k \} } {\mathbf{1}}_{ \{ (2n+1)j = \ell \} }\nonumber \\\le & {} \mu ^2 \sum _{\begin{array}{c} k,\ell \ge 1\\ k \not = \ell \end{array}} |v_k| |v_\ell | \sum _{m,n \ge 0} \frac{1}{(2m+1)^2} \frac{1}{(2n+1)^2} {\mathbf{1}}_{ \{ (2m+1)\ell = (2n+1)k \} } \nonumber \\\le & {} \mu ^2 \frac{\pi ^4}{192} \Vert v\Vert _2^2 = \frac{1}{2} \Vert v\Vert _2^2, \end{aligned}$$
(37)

where the last inequality used Lemma 9.1 again. Therefore, we obtain

$$\begin{aligned} \left| \langle (T_{{\mathcal {C}}} T_{{\mathcal {C}}}^* - I)v, v \rangle \right| = \left| \Vert T_{{\mathcal {C}}}^*v\Vert _2^2 - \Vert v\Vert _2^2 \right| = \left| (\Sigma _{(=)}^* - \Vert v\Vert _2^2) + \Sigma _{(\not =)}^* \right| \le (1-\mu ^2) \Vert v\Vert _2^2 + \frac{1}{2} \Vert v\Vert _2^2. \end{aligned}$$

Taking the maximum over all \(v \in \ell _2({\mathbb {N}})\) with \(\Vert v\Vert _2=1\), we arrive at the result announced in (36) with \(\rho := 1-\mu ^2 + 1/2 \le 0.5145\). The proof is now complete. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Daubechies, I., DeVore, R., Foucart, S. et al. Nonlinear Approximation and (Deep) \(\mathrm {ReLU}\) Networks. Constr Approx 55, 127–172 (2022). https://doi.org/10.1007/s00365-021-09548-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00365-021-09548-z

Keywords

Mathematics Subject Classification

Navigation