Unregularized Online Algorithms with Varying Gaussians

Wang, Baobin; Hu, Ting

doi:10.1007/s00365-021-09536-3

Unregularized Online Algorithms with Varying Gaussians

Published: 07 April 2021

Volume 53, pages 403–440, (2021)
Cite this article

Constructive Approximation Aims and scope

Baobin Wang¹ &
Ting Hu²

179 Accesses
1 Citation
Explore all metrics

Abstract

Gaussians are a family of Mercer kernels which are widely used in machine learning and statistics. The variance of a Gaussian kernel reflexes the specific structure of the reproducing kernel Hilbert spaces (RKHS) induced by the Gaussian or other important features of learning problems such as the frequency of function components. As the variance of the Gaussian decreases, the learning performance and approximation ability will improve. This paper introduces the unregularized online algorithm with decreasing Gaussians where no regularization term is imposed and the samples are presented in sequence. With the appropriate step sizes, concrete learning rates are derived under the smoothness assumptions on the target function, for which are used to bound the approximation error. Additionally, a new type of the geometric noise condition is proposed to estimate the approximation error instead of any smoothness assumption. It is more general than the work in Steinwart et al. (Ann Stat 35(2):575–607, 2007), for which is only suitable for the hinge loss. An essential estimate is to bound the difference of the approximation functions generated by varying Gaussian RKHS. Fourier transform plays a crucial role in our analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimality of Robust Online Learning

Article 26 July 2023

Fast and strong convergence of online learning algorithms

Article 06 June 2019

The learning performance of the weak rescaled pure greedy algorithms

Article Open access 04 March 2024

References

Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)
Article MathSciNet Google Scholar
Cesa-Bianchi, N., Conconi, A., Gentile, C.: On the generalization ability of on-line learning algorithms. IEEE Trans. Inf. Theory 50(9), 2050–2057 (2004)
Article MathSciNet Google Scholar
Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Mach. Learn. 46(1–3), 131–159 (2002)
Article Google Scholar
Chen, D.R., Wu, Q., Ying, Y., Zhou, D.-X.: Support vector machine soft margin classifiers: error analysis. J. Mach. Learn. Res. 18(5), 1143–1175 (2004)
MathSciNet MATH Google Scholar
Dieuleveut, A., Bach, F., et al.: Nonparametric stochastic approximation with large step-sizes. Ann. Stat. 44(4), 1363–1399 (2016)
Article MathSciNet Google Scholar
Eberts, M., Steinwart, I.: Optimal learning rates for least squares SVMs using Gaussian kernels. In: Advances in Neural Information Processing Systems, pp. 1539–1547 (2011)
Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 109–117. ACM (2004)
Hang, H., Steinwart, I.: Optimal learning with anisotropic Gaussian SVMs. arXiv preprint arXiv:1810.02321 (2018)
Lanckriet, G.R., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5(Jan), 27–72 (2004)
MathSciNet MATH Google Scholar
Lei, Y., Shi, L., Guo, Z.-C.: Convergence of unregularized online learning algorithms. J. Mach. Learn. Res. 18(1), 6269–6301 (2017)
MathSciNet MATH Google Scholar
Lin, J., Zhou, D.-X.: Online learning algorithms can converge comparably fast as batch learning. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2367–2378 (2018)
Article MathSciNet Google Scholar
Matache, M.T., Matache, V.: Hilbert spaces induced by Toeplitz covariance kernels. In: Pasik-Duncan, B. (ed.) Stochastic Theory and Control, pp. 319–333. Springer, Berlin (2002)
MATH Google Scholar
Rakhlin, A., Panchenko, D., Mukherjee, S.: Risk bounds for mixture density estimation. ESAIM Probab. Stat. 9, 220–229 (2005)
Article MathSciNet Google Scholar
Smale, S., Yao, Y.: Online learning algorithms. Found. Comput. Math. 6(2), 145–170 (2006)
Article MathSciNet Google Scholar
Smale, S., Zhou, D.-X.: Estimating the approximation error in learning theory. Anal. Appl. 1(01), 17–41 (2003)
Article MathSciNet Google Scholar
Steinwart, I., Hush, D., Scovel, C.: An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Trans. Inf. Theory 52(10), 4635–4643 (2006)
Article MathSciNet Google Scholar
Steinwart, I., Scovel, C., et al.: Fast rates for support vector machines using Gaussian kernels. Ann. Stat. 35(2), 575–607 (2007)
Article MathSciNet Google Scholar
Tsybakov, A.B., et al.: Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32(1), 135–166 (2004)
Article MathSciNet Google Scholar
Wu, Q., Ying, Y., Zhou, D.-X.: Multi-kernel regularized classifiers. J. Complex. 23(1), 108–134 (2007)
Article MathSciNet Google Scholar
Xiang, D.-H., Zhou, D.-X.: Classification with Gaussians and convex loss. J. Mach. Learn. Res. 10(Jul), 1447–1468 (2009)
MathSciNet MATH Google Scholar
Yang, Y.: Minimax nonparametric classification. I. Rates of convergence. IEEE Trans. Inf. Theory 45(7), 2271–2284 (1999)
Article MathSciNet Google Scholar
Ye, G.-B., Zhou, D.-X.: Fully online classification by regularization. Appl. Comput. Harmon. Anal. 23(2), 198–214 (2007)
Article MathSciNet Google Scholar
Ying, Y., Pontil, M.: Online gradient descent learning algorithms. Found. Comput. Math. 8(5), 561–596 (2008)
Article MathSciNet Google Scholar
Ying, Y., Zhou, D.-X.: Online regularized classification algorithms. IEEE Trans. Inf. Theory 52(11), 4775–4788 (2006)
Article MathSciNet Google Scholar
Ying, Y., Zhou, D.-X.: Learnability of Gaussians with flexible variances. J. Mach. Learn. Res. 8(Feb), 249–276 (2007)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan, 430074, People’s Republic of China
Baobin Wang
School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, People’s Republic of China
Ting Hu

Authors

Baobin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ting Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting Hu.

Additional information

Communicated by G. Kerkyacharian.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work described in this paper is partially supported by the National Natural Science Foundation of China [Nos. 11671307 and 11571078], Natural Science Foundation of Hubei Province in China [No. 2017CFB523] and the Fundamental Research Funds for the Central Universities, South-Central University for Nationalities [No. CZY20012].

Appendices

Appendixes

Some Properties of RKHS with Gaussian Kernels

This section presents known results about the Gaussian kernel $k_\sigma $ and its associated RKHS $\mathcal H_\sigma $, that are useful for the proofs of our results.

Lemma A.1

[16] For any $0<\sigma <\tau ,$

$$\begin{aligned} \mathcal{H}_\tau \subset \mathcal{H}_\sigma . \end{aligned}$$

Moreover, for any $f\in \mathcal{H}_\tau ,$

$$\begin{aligned} \Vert f\Vert _{\sigma }\le \left( \frac{\tau }{\sigma }\right) ^{\frac{n}{2}}\Vert f\Vert _{\tau }. \end{aligned}$$

Lemma A.2

[12] Let $\mathcal{H}_{\sigma }({\mathbb {R}}^n)$ be the RKHS with the Gaussian kernel $K_\sigma (\cdot ,\cdot )$ on ${\mathbb {R}}^n\times {\mathbb {R}}^n.$ Then its associated norm is given for any $f\in \mathcal{H}_{\sigma }({\mathbb {R}}^n)$ by

$$\begin{aligned} \Vert f\Vert ^2_{\mathcal{H}_{\sigma }({\mathbb {R}}^n)}=\frac{1}{(2\pi )^{\frac{3n}{2}}\sigma ^{n}}\int _{{\mathbb {R}}^n}| {{\hat{f}}}(w) |^2e^{\frac{\sigma ^2|w|^2}{2}}\mathrm{d}w \end{aligned}$$

where ${{\hat{f}}}(w)$ denotes the Fourier transform of the function f.

Lemma A.3

[1] Let $\mathcal{X}$ be a non-empty subset of ${\mathbb {R}}^n.$ Then

$$\begin{aligned} \mathcal{H}_\sigma :=\left\{ f={{\tilde{f}}}|_\mathcal{X},{\tilde{f}} \in \mathcal{H}_{\sigma }({\mathbb {R}}^n)\right\} \ and \ \Vert f\Vert _\sigma =\inf \left\{ \Vert {{\tilde{f}}}\Vert _{\mathcal{H}_{\sigma }({\mathbb {R}}^n)},{{\tilde{f}}}|_\mathcal{X}=f\right\} . \end{aligned}$$

Elementary Inequalities

Lemma B.1

For $T\ge 3,$ the elementary inequalities hold as follows.

$$\begin{aligned} \sum _{j=1}^T j^{-\theta ^*}\le {\left\{ \begin{array}{ll}\frac{T^{1-\theta ^*}}{1-\theta ^*}, &{}\quad \text {if } 0<\theta ^*<1;\\ 2\log T,&{} \quad \text {if } \theta ^*=1;\\ \frac{\theta ^*}{\theta ^*-1},&{} \quad \text {if } \theta ^*>1. \end{array}\right. } \end{aligned}$$

Lemma B.2

For $T\ge 3,$

$$\begin{aligned} \sum _{k=1}^{T-2}\frac{1}{k(k+1)}\sum _{t=T-k}^T (t-1)^{-1}\le 16T^{-1}(\log T). \end{aligned}$$

Proof

When $1\le k\le \frac{T}{2},$ then $\sum _{t=T-k}^T (t-1)^{-1}\le (T-k-1)^{-1}(k+1)\le 6T^{-1}(k+1),$ and when $\frac{T}{2}< k\le T-2,$ using Lemma B.1, then $\sum _{t=T-k}^T (t-1)^{-1}\le 2\log T.$ Combing the two cases yields that

$$\begin{aligned}&\sum _{k=1}^{T-2}\frac{1}{k(k+1)}\sum _{t=T-k}^T (t-2)^{-1}\le 6T^{-1}\sum _{k=1}^{\frac{T}{2}}\frac{1}{k}+2\log T\sum _{k=\frac{T}{2}+1}^{T-1}\frac{1}{k(k+1)}\\&\quad \le 6T^{-1}\left( 2\log \frac{T}{2}\right) +4(\log T)T^{-1}=16T^{-1}(\log T). \end{aligned}$$

$\square $

Lemma B.3

We have for $T\ge 3,$

$$\begin{aligned} \sum _{k=1}^{T-2}\frac{(T-k-1)^{-1}}{k}\le 16 T^{-1}(\log T). \end{aligned}$$

Proof

When $1\le k\le \frac{T}{2},$ $(T-k-1)^{-1}\le 6 T^{-1}$ and when $\frac{T}{2}<k<T,$ $1\le T-k-1< \frac{T}{2}-1.$ Thus, using Lemma B.1, we have that

$$\begin{aligned}&\sum _{k=1}^{T-2}\frac{(T-k-1)^{-1}}{k}\le 6T^{-1}\sum _{k=1}^{\frac{T}{2}}\frac{1}{k}\\&\qquad +\sum _{k=\frac{T}{2}+1}^{T-2}\frac{(T-k-1)^{-1}}{k}\le 12 T^{-1}(\log T)+2T^{-1}\sum _{k=\frac{T}{2}+1}^{T-2}(T-k-1)^{-1}\\&\quad \le 12T^{-1}(\log T)+2T^{-1}\left( \sum _{i=1}^{\frac{T}{2}-2} i^{-1}\right) \le 16T^{-1}(\log T). \end{aligned}$$

$\square $

Lemma B.4

For any $q^*>1,$ we have for $T\ge 3,$

$$\begin{aligned} \sum _{k=1}^{T-2}\frac{1}{k(k+1)}\sum _{t=T-k}^T (t-1)^{-q^*}\le c_{q^*} T^{-1} \end{aligned}$$

where $c_{q^*}$ is a constant independent of T.

Proof

Using Lemma B.1, its proof is similar to that of Lemma B.2. Here we omit it. $\square $

Useful Lemmas and Proofs in Sect. 4

This lemma is key in estimating the approximation ability of $\mathcal{H}_\sigma .$

Lemma C.1

Let $\mathcal{X}$ be a closed unit ball in ${\mathbb {R}}^n$ and $\mathcal{{{\tilde{X}}}}=3\mathcal{X}.$ Assume that $f: \mathcal{X}\rightarrow {\mathbb {R}}$ be a measurable function. Define an extension ${{\tilde{f}}} :\mathcal{{{\tilde{X}}}}\rightarrow {\mathbb {R}}$ of the function f by

$$\begin{aligned} {{\tilde{f}}} (x)= {\left\{ \begin{array}{ll}f(x), &{}\text {if}\quad x\in \mathcal{X};\\ f\left( \frac{x}{|x|}\right) ,&{}\text {if}\quad x\in \mathcal{{{\tilde{X}}}}/\mathcal{X}. \end{array}\right. } \end{aligned}$$

Then for any $x\in \mathcal{X}$ and $\varepsilon >0,$ we have that

$$\begin{aligned} \inf _{u\in \mathcal{X}}\{|x-u|, s.t.|f(x)-f(u)|\ge \varepsilon \}=\inf _{v\in \mathcal{{{\tilde{X}}}}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \}. \end{aligned}$$

Proof

When $v\in \mathcal{{{\tilde{X}}}}/\mathcal{X},$ by simple calculations, we have that $\left| x-\frac{v}{|v|}\right| ^2< |x-v|^2$ for $x\in \mathcal{X}$ and $v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}.$

For any $v\in \mathcal{{{\tilde{X}}}}/\mathcal{X},$ since ${{\tilde{f}}}(v)=f\left( \frac{v}{|v|}\right) ,$ then $|{\tilde{f}}(x)-{\tilde{f}}(v)|=\left| f(x)-f\left( \frac{v}{|v|}\right) \right| $ and

$$\begin{aligned}&\inf _{v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \}=\inf _{v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}}\left\{ |x-v|, s.t.\left| f(x)-f\left( \frac{v}{|v|}\right) \right| \ge \varepsilon \right\} \\&\quad \ge \inf _{v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}}\left\{ \left| x-\frac{v}{|v|}\right| , s.t.\left| f(x)-f\left( \frac{v}{|v|}\right) \right| \ge \varepsilon \right\} \ge \inf _{u\in \mathcal{X}}\{|x-u|, s.t.|f(x)-f(u)|\ge \varepsilon \}. \end{aligned}$$

Notice the fact that for any $x\in \mathcal{X},$

$$\begin{aligned} \inf _{v\in \mathcal{{{\tilde{X}}}}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \}= \inf _{v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \} \end{aligned}$$

or

$$\begin{aligned} \inf _{v\in \mathcal{{{\tilde{X}}}}}\{|x-v|, s.t.|{\tilde{f}}(x)-{\tilde{f}}(v)|\ge \varepsilon \}=\inf _{u\in \mathcal{X}}\{|x-u|, s.t.|f(x)-f(u)|\ge \varepsilon \}. \end{aligned}$$

Based on the estimates as above, the proof is completed. $\square $

Next, we will bound the error caused by the varying Gaussians.

Lemma C.2

Define ${{\tilde{f}}}_{\sigma _t}$ by (45) and $f_{\sigma _t}:={{\tilde{f}}}_{\sigma _t}|_\mathcal{X}$. If the variances $\{\sigma _t,t\in {\mathbb {N}}\}$ with $0<\beta <1,$ then

$$\begin{aligned} \Vert f_{\sigma }\Vert _{\sigma }&\le \frac{1}{(\sqrt{2\pi })^{n/2}}\Vert {\tilde{f}}^\phi _\rho \Vert _{L^2({\mathbb {R}}^n)}t^{n\beta /2}, \end{aligned}$$

(50)

and

$$\begin{aligned} \Vert f_{\sigma _{t}}-f_{\sigma _{t-1}}\Vert _{\sigma _t}\le \frac{2c_\beta +c_\beta ^2}{\sqrt{2}(\sqrt{2\pi })^{n/2}}\Vert {\tilde{f}}^\phi _\rho \Vert _{L^2({\mathbb {R}}^n)}t^{-1+\frac{n\beta }{2}} \end{aligned}$$

(51)

where $c_\beta $ is given in the proof of Lemma 1.

Proof

Notice that the Fourier transform $\hat{{\tilde{f}}}_{\sigma }(w)=\hat{{\tilde{K}}}_\sigma (w)\hat{{\tilde{f}}}^\phi _\rho (w)=\exp \left\{ -\frac{\sigma ^2|w|^2}{2}\right\} \hat{{\tilde{f}}}^\phi _\rho (w).$ With Lemmas A.2 and A.3, by the similar procedure in the proofs of Lemma 1, the conclusions (50) and (51) can be obtained. $\square $

With the help of the above lemmas, we can prove our convergence rate in Sect. 4.

Proof of Theorem 5

We shall prove the conclusions above by Theorem 1. By (51) and (46), we have

$$\begin{aligned} \mathcal{A}_t\le \mathcal{A}_1t^{-\frac{\zeta \beta }{1+\zeta }}\ and\ \mathcal{B}_t\le \mathcal{B}_1 t^{-1+\frac{n\beta }{2}} \end{aligned}$$

with $\mathcal{A}_1=C_{n,\zeta ,q}$ and $\mathcal{B}_1=\frac{2c_\beta +c_\beta ^2}{\sqrt{2}(\sqrt{2\pi })^{n/2}}\Vert {\tilde{f}}^\phi _\rho \Vert _{L^2({\mathbb {R}}^n)}.$ Taking $\tau =\frac{1+\epsilon }{1-\frac{n\beta }{2}},$ by (50) and (31), we have that

$$\begin{aligned} {\mathrm{I}\mathrm{E}}_{z_1,\ldots ,z_t}\left[ \Vert f_{t+1}\Vert ^2_{\sigma _t}\right] \le C^{**} t^{1-\min \{\theta +\frac{\zeta \beta }{1+\zeta },1-n\beta -\epsilon \}} \end{aligned}$$

where $C^{**}=4C_{\tau ,q,\eta }\left( 1+\frac{2\eta \mathcal{A}_1+\mathcal{B}_1^\tau +\mathcal{B}_1^{2-\tau }}{\max \{1-\theta -\zeta \beta (1+\zeta )^{-1},n\beta +\epsilon \}} +\frac{q^*}{q^*-1}\frac{\Vert {\tilde{f}}^\phi _\rho \Vert ^2_{L^2({\mathbb {R}}^n)}}{(\sqrt{2\pi })^{n}} \right) +\frac{2\Vert {\tilde{f}}^\phi _\rho \Vert ^2_{L^2({\mathbb {R}}^n)}}{(\sqrt{2\pi })^{n}}.$ Putting the estimates above into (7), following the same proof procedure of Theorem 2, we can get the desired conclusion with

$$\begin{aligned} C_3'= \frac{2{\tilde{C}}\max \Bigg \{\frac{\Vert {\tilde{f}}^\phi _\rho \Vert ^2_{L^2({\mathbb {R}}^n)}}{(\sqrt{2\pi })^{n}},C^{**},\eta \mathcal{A}_1,\mathcal{B}_1^{2-\tau } \Bigg \}}{\max \{1-\theta -\zeta \beta (1+\zeta )^{-1},n\beta +\epsilon \}}. \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, B., Hu, T. Unregularized Online Algorithms with Varying Gaussians. Constr Approx 53, 403–440 (2021). https://doi.org/10.1007/s00365-021-09536-3

Download citation

Received: 07 April 2019
Revised: 13 October 2019
Accepted: 29 October 2019
Published: 07 April 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s00365-021-09536-3

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unregularized Online Algorithms with Varying Gaussians

Abstract

Access this article

Similar content being viewed by others

Optimality of Robust Online Learning

Fast and strong convergence of online learning algorithms

The learning performance of the weak rescaled pure greedy algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendixes

Some Properties of RKHS with Gaussian Kernels

Lemma A.1

Lemma A.2

Lemma A.3

Elementary Inequalities

Lemma B.1

Lemma B.2

Proof

Lemma B.3

Proof

Lemma B.4

Proof

Useful Lemmas and Proofs in Sect. 4

Lemma C.1

Proof

Lemma C.2

Proof

Proof of Theorem 5

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Unregularized Online Algorithms with Varying Gaussians

Abstract

Access this article

Similar content being viewed by others

Optimality of Robust Online Learning

Fast and strong convergence of online learning algorithms

The learning performance of the weak rescaled pure greedy algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendixes

Some Properties of RKHS with Gaussian Kernels

Lemma A.1

Lemma A.2

Lemma A.3

Elementary Inequalities

Lemma B.1

Lemma B.2

Proof

Lemma B.3

Proof

Lemma B.4

Proof

Useful Lemmas and Proofs in Sect. 4

Lemma C.1

Proof

Lemma C.2

Proof

Proof of Theorem 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation