Skip to main content
Log in

Unregularized Online Algorithms with Varying Gaussians

  • Published:
Constructive Approximation Aims and scope

Abstract

Gaussians are a family of Mercer kernels which are widely used in machine learning and statistics. The variance of a Gaussian kernel reflexes the specific structure of the reproducing kernel Hilbert spaces (RKHS) induced by the Gaussian or other important features of learning problems such as the frequency of function components. As the variance of the Gaussian decreases, the learning performance and approximation ability will improve. This paper introduces the unregularized online algorithm with decreasing Gaussians where no regularization term is imposed and the samples are presented in sequence. With the appropriate step sizes, concrete learning rates are derived under the smoothness assumptions on the target function, for which are used to bound the approximation error. Additionally, a new type of the geometric noise condition is proposed to estimate the approximation error instead of any smoothness assumption. It is more general than the work in Steinwart et al. (Ann Stat 35(2):575–607, 2007), for which is only suitable for the hinge loss. An essential estimate is to bound the difference of the approximation functions generated by varying Gaussian RKHS. Fourier transform plays a crucial role in our analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)

    Article  MathSciNet  Google Scholar 

  2. Cesa-Bianchi, N., Conconi, A., Gentile, C.: On the generalization ability of on-line learning algorithms. IEEE Trans. Inf. Theory 50(9), 2050–2057 (2004)

    Article  MathSciNet  Google Scholar 

  3. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Mach. Learn. 46(1–3), 131–159 (2002)

    Article  Google Scholar 

  4. Chen, D.R., Wu, Q., Ying, Y., Zhou, D.-X.: Support vector machine soft margin classifiers: error analysis. J. Mach. Learn. Res. 18(5), 1143–1175 (2004)

    MathSciNet  MATH  Google Scholar 

  5. Dieuleveut, A., Bach, F., et al.: Nonparametric stochastic approximation with large step-sizes. Ann. Stat. 44(4), 1363–1399 (2016)

    Article  MathSciNet  Google Scholar 

  6. Eberts, M., Steinwart, I.: Optimal learning rates for least squares SVMs using Gaussian kernels. In: Advances in Neural Information Processing Systems, pp. 1539–1547 (2011)

  7. Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 109–117. ACM (2004)

  8. Hang, H., Steinwart, I.: Optimal learning with anisotropic Gaussian SVMs. arXiv preprint arXiv:1810.02321 (2018)

  9. Lanckriet, G.R., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5(Jan), 27–72 (2004)

    MathSciNet  MATH  Google Scholar 

  10. Lei, Y., Shi, L., Guo, Z.-C.: Convergence of unregularized online learning algorithms. J. Mach. Learn. Res. 18(1), 6269–6301 (2017)

    MathSciNet  MATH  Google Scholar 

  11. Lin, J., Zhou, D.-X.: Online learning algorithms can converge comparably fast as batch learning. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2367–2378 (2018)

    Article  MathSciNet  Google Scholar 

  12. Matache, M.T., Matache, V.: Hilbert spaces induced by Toeplitz covariance kernels. In: Pasik-Duncan, B. (ed.) Stochastic Theory and Control, pp. 319–333. Springer, Berlin (2002)

    MATH  Google Scholar 

  13. Rakhlin, A., Panchenko, D., Mukherjee, S.: Risk bounds for mixture density estimation. ESAIM Probab. Stat. 9, 220–229 (2005)

    Article  MathSciNet  Google Scholar 

  14. Smale, S., Yao, Y.: Online learning algorithms. Found. Comput. Math. 6(2), 145–170 (2006)

    Article  MathSciNet  Google Scholar 

  15. Smale, S., Zhou, D.-X.: Estimating the approximation error in learning theory. Anal. Appl. 1(01), 17–41 (2003)

    Article  MathSciNet  Google Scholar 

  16. Steinwart, I., Hush, D., Scovel, C.: An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Trans. Inf. Theory 52(10), 4635–4643 (2006)

    Article  MathSciNet  Google Scholar 

  17. Steinwart, I., Scovel, C., et al.: Fast rates for support vector machines using Gaussian kernels. Ann. Stat. 35(2), 575–607 (2007)

    Article  MathSciNet  Google Scholar 

  18. Tsybakov, A.B., et al.: Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32(1), 135–166 (2004)

    Article  MathSciNet  Google Scholar 

  19. Wu, Q., Ying, Y., Zhou, D.-X.: Multi-kernel regularized classifiers. J. Complex. 23(1), 108–134 (2007)

    Article  MathSciNet  Google Scholar 

  20. Xiang, D.-H., Zhou, D.-X.: Classification with Gaussians and convex loss. J. Mach. Learn. Res. 10(Jul), 1447–1468 (2009)

    MathSciNet  MATH  Google Scholar 

  21. Yang, Y.: Minimax nonparametric classification. I. Rates of convergence. IEEE Trans. Inf. Theory 45(7), 2271–2284 (1999)

    Article  MathSciNet  Google Scholar 

  22. Ye, G.-B., Zhou, D.-X.: Fully online classification by regularization. Appl. Comput. Harmon. Anal. 23(2), 198–214 (2007)

    Article  MathSciNet  Google Scholar 

  23. Ying, Y., Pontil, M.: Online gradient descent learning algorithms. Found. Comput. Math. 8(5), 561–596 (2008)

    Article  MathSciNet  Google Scholar 

  24. Ying, Y., Zhou, D.-X.: Online regularized classification algorithms. IEEE Trans. Inf. Theory 52(11), 4775–4788 (2006)

    Article  MathSciNet  Google Scholar 

  25. Ying, Y., Zhou, D.-X.: Learnability of Gaussians with flexible variances. J. Mach. Learn. Res. 8(Feb), 249–276 (2007)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting Hu.

Additional information

Communicated by G. Kerkyacharian.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work described in this paper is partially supported by the National Natural Science Foundation of China [Nos. 11671307 and 11571078], Natural Science Foundation of Hubei Province in China [No. 2017CFB523] and the Fundamental Research Funds for the Central Universities, South-Central University for Nationalities [No. CZY20012].

Appendices

Appendixes

Some Properties of RKHS with Gaussian Kernels

This section presents known results about the Gaussian kernel \(k_\sigma \) and its associated RKHS \(\mathcal H_\sigma \), that are useful for the proofs of our results.

Lemma A.1

[16] For any \(0<\sigma <\tau ,\)

$$\begin{aligned} \mathcal{H}_\tau \subset \mathcal{H}_\sigma . \end{aligned}$$

Moreover, for any \(f\in \mathcal{H}_\tau ,\)

$$\begin{aligned} \Vert f\Vert _{\sigma }\le \left( \frac{\tau }{\sigma }\right) ^{\frac{n}{2}}\Vert f\Vert _{\tau }. \end{aligned}$$

Lemma A.2

[12] Let \(\mathcal{H}_{\sigma }({\mathbb {R}}^n)\) be the RKHS with the Gaussian kernel \(K_\sigma (\cdot ,\cdot )\) on \({\mathbb {R}}^n\times {\mathbb {R}}^n.\) Then its associated norm is given for any \(f\in \mathcal{H}_{\sigma }({\mathbb {R}}^n)\) by

$$\begin{aligned} \Vert f\Vert ^2_{\mathcal{H}_{\sigma }({\mathbb {R}}^n)}=\frac{1}{(2\pi )^{\frac{3n}{2}}\sigma ^{n}}\int _{{\mathbb {R}}^n}| {{\hat{f}}}(w) |^2e^{\frac{\sigma ^2|w|^2}{2}}\mathrm{d}w \end{aligned}$$

where \({{\hat{f}}}(w)\) denotes the Fourier transform of the function f.

Lemma A.3

[1] Let \(\mathcal{X}\) be a non-empty subset of \({\mathbb {R}}^n.\) Then

$$\begin{aligned} \mathcal{H}_\sigma :=\left\{ f={{\tilde{f}}}|_\mathcal{X},{\tilde{f}} \in \mathcal{H}_{\sigma }({\mathbb {R}}^n)\right\} \ and \ \Vert f\Vert _\sigma =\inf \left\{ \Vert {{\tilde{f}}}\Vert _{\mathcal{H}_{\sigma }({\mathbb {R}}^n)},{{\tilde{f}}}|_\mathcal{X}=f\right\} . \end{aligned}$$

Elementary Inequalities

Lemma B.1

For \(T\ge 3,\) the elementary inequalities hold as follows.

$$\begin{aligned} \sum _{j=1}^T j^{-\theta ^*}\le {\left\{ \begin{array}{ll}\frac{T^{1-\theta ^*}}{1-\theta ^*}, &{}\quad \text {if } 0<\theta ^*<1;\\ 2\log T,&{} \quad \text {if } \theta ^*=1;\\ \frac{\theta ^*}{\theta ^*-1},&{} \quad \text {if } \theta ^*>1. \end{array}\right. } \end{aligned}$$

Lemma B.2

For \(T\ge 3,\)

$$\begin{aligned} \sum _{k=1}^{T-2}\frac{1}{k(k+1)}\sum _{t=T-k}^T (t-1)^{-1}\le 16T^{-1}(\log T). \end{aligned}$$

Proof

When \(1\le k\le \frac{T}{2},\) then \(\sum _{t=T-k}^T (t-1)^{-1}\le (T-k-1)^{-1}(k+1)\le 6T^{-1}(k+1),\) and when \(\frac{T}{2}< k\le T-2,\) using Lemma B.1, then \(\sum _{t=T-k}^T (t-1)^{-1}\le 2\log T.\) Combing the two cases yields that

$$\begin{aligned}&\sum _{k=1}^{T-2}\frac{1}{k(k+1)}\sum _{t=T-k}^T (t-2)^{-1}\le 6T^{-1}\sum _{k=1}^{\frac{T}{2}}\frac{1}{k}+2\log T\sum _{k=\frac{T}{2}+1}^{T-1}\frac{1}{k(k+1)}\\&\quad \le 6T^{-1}\left( 2\log \frac{T}{2}\right) +4(\log T)T^{-1}=16T^{-1}(\log T). \end{aligned}$$

\(\square \)

Lemma B.3

We have for \(T\ge 3,\)

$$\begin{aligned} \sum _{k=1}^{T-2}\frac{(T-k-1)^{-1}}{k}\le 16 T^{-1}(\log T). \end{aligned}$$

Proof

When \(1\le k\le \frac{T}{2},\) \((T-k-1)^{-1}\le 6 T^{-1}\) and when \(\frac{T}{2}<k<T,\) \(1\le T-k-1< \frac{T}{2}-1.\) Thus, using Lemma B.1, we have that

$$\begin{aligned}&\sum _{k=1}^{T-2}\frac{(T-k-1)^{-1}}{k}\le 6T^{-1}\sum _{k=1}^{\frac{T}{2}}\frac{1}{k}\\&\qquad +\sum _{k=\frac{T}{2}+1}^{T-2}\frac{(T-k-1)^{-1}}{k}\le 12 T^{-1}(\log T)+2T^{-1}\sum _{k=\frac{T}{2}+1}^{T-2}(T-k-1)^{-1}\\&\quad \le 12T^{-1}(\log T)+2T^{-1}\left( \sum _{i=1}^{\frac{T}{2}-2} i^{-1}\right) \le 16T^{-1}(\log T). \end{aligned}$$

\(\square \)

Lemma B.4

For any \(q^*>1,\) we have for \(T\ge 3,\)

$$\begin{aligned} \sum _{k=1}^{T-2}\frac{1}{k(k+1)}\sum _{t=T-k}^T (t-1)^{-q^*}\le c_{q^*} T^{-1} \end{aligned}$$

where \(c_{q^*}\) is a constant independent of T.

Proof

Using Lemma B.1, its proof is similar to that of Lemma B.2. Here we omit it. \(\square \)

Useful Lemmas and Proofs in Sect. 4

This lemma is key in estimating the approximation ability of \(\mathcal{H}_\sigma .\)

Lemma C.1

Let \(\mathcal{X}\) be a closed unit ball in \({\mathbb {R}}^n\) and \(\mathcal{{{\tilde{X}}}}=3\mathcal{X}.\) Assume that \(f: \mathcal{X}\rightarrow {\mathbb {R}}\) be a measurable function. Define an extension \({{\tilde{f}}} :\mathcal{{{\tilde{X}}}}\rightarrow {\mathbb {R}}\) of the function f by

$$\begin{aligned} {{\tilde{f}}} (x)= {\left\{ \begin{array}{ll}f(x), &{}\text {if}\quad x\in \mathcal{X};\\ f\left( \frac{x}{|x|}\right) ,&{}\text {if}\quad x\in \mathcal{{{\tilde{X}}}}/\mathcal{X}. \end{array}\right. } \end{aligned}$$

Then for any \(x\in \mathcal{X}\) and \(\varepsilon >0,\) we have that

$$\begin{aligned} \inf _{u\in \mathcal{X}}\{|x-u|, s.t.|f(x)-f(u)|\ge \varepsilon \}=\inf _{v\in \mathcal{{{\tilde{X}}}}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \}. \end{aligned}$$

Proof

When \(v\in \mathcal{{{\tilde{X}}}}/\mathcal{X},\) by simple calculations, we have that \(\left| x-\frac{v}{|v|}\right| ^2< |x-v|^2\) for \(x\in \mathcal{X}\) and \(v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}.\)

For any \(v\in \mathcal{{{\tilde{X}}}}/\mathcal{X},\) since \({{\tilde{f}}}(v)=f\left( \frac{v}{|v|}\right) ,\) then \(|{\tilde{f}}(x)-{\tilde{f}}(v)|=\left| f(x)-f\left( \frac{v}{|v|}\right) \right| \) and

$$\begin{aligned}&\inf _{v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \}=\inf _{v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}}\left\{ |x-v|, s.t.\left| f(x)-f\left( \frac{v}{|v|}\right) \right| \ge \varepsilon \right\} \\&\quad \ge \inf _{v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}}\left\{ \left| x-\frac{v}{|v|}\right| , s.t.\left| f(x)-f\left( \frac{v}{|v|}\right) \right| \ge \varepsilon \right\} \ge \inf _{u\in \mathcal{X}}\{|x-u|, s.t.|f(x)-f(u)|\ge \varepsilon \}. \end{aligned}$$

Notice the fact that for any \(x\in \mathcal{X},\)

$$\begin{aligned} \inf _{v\in \mathcal{{{\tilde{X}}}}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \}= \inf _{v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \} \end{aligned}$$

or

$$\begin{aligned} \inf _{v\in \mathcal{{{\tilde{X}}}}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \}=\inf _{u\in \mathcal{X}}\{|x-u|, s.t.|f(x)-f(u)|\ge \varepsilon \}. \end{aligned}$$

Based on the estimates as above, the proof is completed. \(\square \)

Next, we will bound the error caused by the varying Gaussians.

Lemma C.2

Define \({{\tilde{f}}}_{\sigma _t}\) by (45) and \(f_{\sigma _t}:={{\tilde{f}}}_{\sigma _t}|_\mathcal{X}\). If the variances \(\{\sigma _t,t\in {\mathbb {N}}\}\) with \(0<\beta <1,\) then

$$\begin{aligned} \Vert f_{\sigma }\Vert _{\sigma }&\le \frac{1}{(\sqrt{2\pi })^{n/2}}\Vert {\tilde{f}}^\phi _\rho \Vert _{L^2({\mathbb {R}}^n)}t^{n\beta /2}, \end{aligned}$$
(50)

and

$$\begin{aligned} \Vert f_{\sigma _{t}}-f_{\sigma _{t-1}}\Vert _{\sigma _t}\le \frac{2c_\beta +c_\beta ^2}{\sqrt{2}(\sqrt{2\pi })^{n/2}}\Vert {\tilde{f}}^\phi _\rho \Vert _{L^2({\mathbb {R}}^n)}t^{-1+\frac{n\beta }{2}} \end{aligned}$$
(51)

where \(c_\beta \) is given in the proof of Lemma 1.

Proof

Notice that the Fourier transform \(\hat{{\tilde{f}}}_{\sigma }(w)=\hat{{\tilde{K}}}_\sigma (w)\hat{{\tilde{f}}}^\phi _\rho (w)=\exp \left\{ -\frac{\sigma ^2|w|^2}{2}\right\} \hat{{\tilde{f}}}^\phi _\rho (w).\) With Lemmas A.2 and A.3, by the similar procedure in the proofs of Lemma 1, the conclusions (50) and (51) can be obtained. \(\square \)

With the help of the above lemmas, we can prove our convergence rate in Sect. 4.

Proof of Theorem 5

We shall prove the conclusions above by Theorem 1. By (51) and (46), we have

$$\begin{aligned} \mathcal{A}_t\le \mathcal{A}_1t^{-\frac{\zeta \beta }{1+\zeta }}\ and\ \mathcal{B}_t\le \mathcal{B}_1 t^{-1+\frac{n\beta }{2}} \end{aligned}$$

with \(\mathcal{A}_1=C_{n,\zeta ,q}\) and \(\mathcal{B}_1=\frac{2c_\beta +c_\beta ^2}{\sqrt{2}(\sqrt{2\pi })^{n/2}}\Vert {\tilde{f}}^\phi _\rho \Vert _{L^2({\mathbb {R}}^n)}.\) Taking \(\tau =\frac{1+\epsilon }{1-\frac{n\beta }{2}},\) by (50) and (31), we have that

$$\begin{aligned} {\mathrm{I}\mathrm{E}}_{z_1,\ldots ,z_t}\left[ \Vert f_{t+1}\Vert ^2_{\sigma _t}\right] \le C^{**} t^{1-\min \{\theta +\frac{\zeta \beta }{1+\zeta },1-n\beta -\epsilon \}} \end{aligned}$$

where \(C^{**}=4C_{\tau ,q,\eta }\left( 1+\frac{2\eta \mathcal{A}_1+\mathcal{B}_1^\tau +\mathcal{B}_1^{2-\tau }}{\max \{1-\theta -\zeta \beta (1+\zeta )^{-1},n\beta +\epsilon \}} +\frac{q^*}{q^*-1}\frac{\Vert {\tilde{f}}^\phi _\rho \Vert ^2_{L^2({\mathbb {R}}^n)}}{(\sqrt{2\pi })^{n}} \right) +\frac{2\Vert {\tilde{f}}^\phi _\rho \Vert ^2_{L^2({\mathbb {R}}^n)}}{(\sqrt{2\pi })^{n}}.\) Putting the estimates above into (7), following the same proof procedure of Theorem 2, we can get the desired conclusion with

$$\begin{aligned} C_3'= \frac{2{\tilde{C}}\max \Bigg \{\frac{\Vert {\tilde{f}}^\phi _\rho \Vert ^2_{L^2({\mathbb {R}}^n)}}{(\sqrt{2\pi })^{n}},C^{**},\eta \mathcal{A}_1,\mathcal{B}_1^{2-\tau } \Bigg \}}{\max \{1-\theta -\zeta \beta (1+\zeta )^{-1},n\beta +\epsilon \}}. \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, B., Hu, T. Unregularized Online Algorithms with Varying Gaussians. Constr Approx 53, 403–440 (2021). https://doi.org/10.1007/s00365-021-09536-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00365-021-09536-3

Keywords

Mathematics Subject Classification

Navigation